Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-03 Thread via GitHub
dweiss commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2696384821 I can generate this file and make it available as a benchmark dataset. Or would you rather give me one of your own, for consistency with your previous results? -- This is an a

Re: [I] Improve documentation for org.apache.lucene.search Sort class [lucene]

2025-03-03 Thread via GitHub
msokolov commented on issue #14295: URL: https://github.com/apache/lucene/issues/14295#issuecomment-2695731447 thanks for pointing that out, somehow I overlooked it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-03 Thread via GitHub
msokolov commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2695713269 Yes, I was referring to files that can be generated with `infer_token_vectors_cohere.py`. Maybe we take the position that users should regenerate, but it is kind of slow and demand

Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

2025-03-03 Thread via GitHub
uschindler commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2695637219 Hi, > I will throw in a real usecase that gives us a bit of headache: completion fields. All the existing codecs load them on heap, and we want to make a switch to load them of

Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-03 Thread via GitHub
dweiss commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2695553260 > [...] but can we attach 3G files here? I think we can, if it makes sense to do so. We're not supposed to abuse this service - for example by downloading 3gb data file

Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-03 Thread via GitHub
msokolov commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2695517144 There are other vector data files - I think the key one that has become a reference point is Cohere 768d trained on wikipedia-derived docs, but I'm not sure where nightly benchmark

Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-03 Thread via GitHub
benwtrent commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2695529583 @msokolov the python script in Lucene util downloads from hugging face. If that is the data you are talking about? `infer_token_vectors_cohere.py` -- This is an a

Re: [I] Flaky `TestKnnByteVectorQueryMMap.testRandomWithFilter` test failures [lucene]

2025-03-03 Thread via GitHub
benwtrent closed issue #14266: Flaky `TestKnnByteVectorQueryMMap.testRandomWithFilter` test failures URL: https://github.com/apache/lucene/issues/14266 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] Remove some randomness in testRandomWithFilter [lucene]

2025-03-03 Thread via GitHub
benwtrent merged PR #14329: URL: https://github.com/apache/lucene/pull/14329 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-03 Thread via GitHub
dweiss commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2695433674 We now have an s3 bucket to place those benchmark/ reference files on. If you have any of these files - please let me know and perhaps make it available to me, somehow - ```

[PR] Remove some randomness in testRandomWithFilter [lucene]

2025-03-03 Thread via GitHub
benwtrent opened a new pull request, #14329: URL: https://github.com/apache/lucene/pull/14329 I have noticed some rare failures of this test, but every time it failed, it was due to a valid set of kNN docs being found before the exploration limit was actually hit. This is due to extremely l

Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-03 Thread via GitHub
rmuir commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2695438823 @dweiss https://issues.apache.org/jira/secure/attachment/12429835/top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2 -- This is an automated message from the Apache Git Service.

Re: [PR] Avoid using time zones that emit warnings (jdk25+) [lucene]

2025-03-03 Thread via GitHub
dweiss merged PR #14328: URL: https://github.com/apache/lucene/pull/14328 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

2025-03-03 Thread via GitHub
javanna commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2695424434 I will throw in a real usecase that gives us a bit of headache: completion fields. All the existing codecs load them on heap, and we want to make a switch to load them off heap in certai

Re: [I] :lucene:benchmark:getGeoNames github job fails [lucene]

2025-03-03 Thread via GitHub
dweiss closed issue #14144: :lucene:benchmark:getGeoNames github job fails URL: https://github.com/apache/lucene/issues/14144 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[PR] Avoid using time zones that emit warnings (jdk25+) [lucene]

2025-03-03 Thread via GitHub
dweiss opened a new pull request, #14328: URL: https://github.com/apache/lucene/pull/14328 This causes tests that expect exact outputs (like TestReproduceMessage) to occasionally fail under JDK25+. I added some filtering to randomTimeZone so that those warning-emitting time zone codes are n

Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]

2025-03-03 Thread via GitHub
renatoh commented on PR #14311: URL: https://github.com/apache/lucene/pull/14311#issuecomment-2695297490 @rmuir the field is on the super class, hence, it hence we cannot deprecated it. we could deprecated the current constructor and introduce another constructor without the onlyLongest

Re: [PR] Use read advice consistently in the knn vector formats [lucene]

2025-03-03 Thread via GitHub
shatejas commented on PR #14076: URL: https://github.com/apache/lucene/pull/14076#issuecomment-2695220393 > For the exact case that was tested in https://github.com/apache/lucene/pull/13985 that might be a regression. I am challenging the result a bit here since I don't see how the copy tim

Re: [PR] Reuse entry point scores and provide mechanisms to provide scores for directly entry points [lucene]

2025-03-03 Thread via GitHub
benwtrent commented on PR #14256: URL: https://github.com/apache/lucene/pull/14256#issuecomment-2694997840 This proved not particularly useful. Maybe it can be a future optimization, but for now, it seems the added complexity isn't worth its cost. -- This is an automated message from the

Re: [PR] Reuse entry point scores and provide mechanisms to provide scores for directly entry points [lucene]

2025-03-03 Thread via GitHub
benwtrent closed pull request #14256: Reuse entry point scores and provide mechanisms to provide scores for directly entry points URL: https://github.com/apache/lucene/pull/14256 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

2025-03-03 Thread via GitHub
ChrisHegarty commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2694804577 Yeah, I can see your point. What unsettles me a little about the proposed change is the "weight" that it imposes on this simple SPI interface for a somewhat niche issue. That said,

Re: [PR] Use read advice consistently in the knn vector formats [lucene]

2025-03-03 Thread via GitHub
jimczi commented on PR #14076: URL: https://github.com/apache/lucene/pull/14076#issuecomment-2694779286 > Is that truly the case or did I miss something? That's probably the opposite. For the exact case that was tested in https://github.com/apache/lucene/pull/13985 that might be a re

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-03-03 Thread via GitHub
msokolov commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2694771977 see https://github.com/mikemccand/luceneutil/pull/345 for benchmarking support -- This is an automated message from the Apache Git Service. To respond to the message, please log on t

Re: [PR] Use read advice consistently in the knn vector formats [lucene]

2025-03-03 Thread via GitHub
msokolov commented on PR #14076: URL: https://github.com/apache/lucene/pull/14076#issuecomment-2694758015 I briefly skimmed the prior PR, which this effectively undoes, and I did not see much benefit there in terms of improving merge times. Is that truly the case or did I miss something? If

Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

2025-03-03 Thread via GitHub
uschindler commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2694748030 But basically the whole idea here is to allow to replace codecs, which is in reality not wanted at all. If you want a different codec, name it differently. So you am not fully h

Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

2025-03-03 Thread via GitHub
uschindler commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2694739987 Now I understand how this is expected to work. Elasticsearch will return true for the SPI impl. Maybe we should also try to allow different orders for the active discovery. May

Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

2025-03-03 Thread via GitHub
uschindler commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2694709538 Did you think about this one, too: https://github.com/apache/lucene/blob/8e68ed22614dc7841ebea94d3e66561ceb74d25e/lucene/core/src/java/org/apache/lucene/analysis/AnalysisSPILoader.java

Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

2025-03-03 Thread via GitHub
ChrisHegarty commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2694681847 Anyone else ? @uschindler ? I think that this is quite solid, and while the issue we're facing in Elasticsearch is because we deploy as modules, it may not be that widely encounter

Re: [PR] Utility classes to make it easier to use sandbox facet API for most common cases [lucene]

2025-03-03 Thread via GitHub
stefanvodita merged PR #14237: URL: https://github.com/apache/lucene/pull/14237 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucen

Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]

2025-03-03 Thread via GitHub
rmuir commented on PR #14311: URL: https://github.com/apache/lucene/pull/14311#issuecomment-2694568500 @renatoh oh, sorry for the slow feedback, did not realize you had deconflicted it. changes look good to me. I wish there was a way to really use a `@deprecated/@Deprecated` here, bu

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-03-03 Thread via GitHub
msokolov commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2694557420 > ^ I can also help playing around with the above ideas and benchmark to see if it helps (some cases above seems to have high reentries, ~500) Please feel free to experiment! Not

Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]

2025-03-03 Thread via GitHub
renatoh commented on PR #14311: URL: https://github.com/apache/lucene/pull/14311#issuecomment-2694511860 @rmuir any thoughts on my changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] Optimize commit retention policy to maintain only the last 5 commits [lucene]

2025-03-03 Thread via GitHub
rmuir commented on PR #14325: URL: https://github.com/apache/lucene/pull/14325#issuecomment-2694353271 To me, KeepOnlyLastCommit means only the last commit, not the last 5. I don't think this policy should be modified like this. -- This is an automated message from the Apache Git Service

Re: [PR] Fix DirectIOIndexInput seek to not read when position is within buffer [lucene]

2025-03-03 Thread via GitHub
ChrisHegarty merged PR #14320: URL: https://github.com/apache/lucene/pull/14320 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucen

Re: [PR] Make Lucene better at skipping long runs of matches. [lucene]

2025-03-03 Thread via GitHub
gf2121 commented on code in PR #14312: URL: https://github.com/apache/lucene/pull/14312#discussion_r1976952412 ## lucene/core/src/java/org/apache/lucene/search/DenseConjunctionBulkScorer.java: ## @@ -128,6 +128,16 @@ private void scoreWindowUsingBitSet( assert windowMatches

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-03-03 Thread via GitHub
dungba88 commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2694032117 > - Could we readjust the pro-rata rate, not based on the whole index, but based on the effective segments? - What if we just set the per-leaf k to the same as global k in

Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]

2025-03-03 Thread via GitHub
pseudo-nymous commented on issue #13898: URL: https://github.com/apache/lucene/issues/13898#issuecomment-2693761988 Thanks for updating the script with PR number. I'll root cause and fix the issue. -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-03-03 Thread via GitHub
dungba88 commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2693679959 > I added an additional cap on this, but then realized we are already implicitly imposing such a limit here: @msokolov that checks if the *previous* iteration (kInLoop / 2) has ex