Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2024-12-17 Thread via GitHub
gaoj0017 commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2550510539 Hi @benwtrent , I am the first author of the [RaBitQ paper](https://arxiv.org/abs/2405.12497) and [its extended version](https://arxiv.org/abs/2409.09913). As your team have known, our

Re: [I] TestSoftDeletesDirectoryReaderWrapper.testAvoidWrappingReadersWithoutSoftDeletes AssertionError: expected:<5> but was:<3> [lucene]

2024-12-17 Thread via GitHub
easyice closed issue #14020: TestSoftDeletesDirectoryReaderWrapper.testAvoidWrappingReadersWithoutSoftDeletes AssertionError: expected:<5> but was:<3> URL: https://github.com/apache/lucene/issues/14020 -- This is an automated message from the Apache Git Service. To respond to the message, pl

Re: [PR] Fix test failure in TestSoftDeletesDirectoryReaderWrapper on expected number of deletes [lucene]

2024-12-17 Thread via GitHub
easyice merged PR #14057: URL: https://github.com/apache/lucene/pull/14057 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Support disabling IndexSearcher.maxClauseCount with a value of -1 [lucene]

2024-12-17 Thread via GitHub
dsmiley commented on PR #13178: URL: https://github.com/apache/lucene/pull/13178#issuecomment-2550388713 Happy to do so. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
rmuir commented on PR #14075: URL: https://github.com/apache/lucene/pull/14075#issuecomment-2550380846 I took a stab at hacking around this on our side as well: https://github.com/apache/lucene/pull/14079 -- This is an automated message from the Apache Git Service. To respond to the messa

Re: [PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
rmuir commented on PR #14075: URL: https://github.com/apache/lucene/pull/14075#issuecomment-2550335515 Thank you @eusousu I tried to make some progress, for now at least I have an open bug report: https://bugs.documentfoundation.org/show_bug.cgi?id=164366 -- This is an automated m

Re: [PR] Use read advice consistently in the knn vector formats [lucene]

2024-12-17 Thread via GitHub
navneet1v commented on code in PR #14076: URL: https://github.com/apache/lucene/pull/14076#discussion_r1889600647 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsFormat.java: ## @@ -78,21 +79,23 @@ public final class Lucene99FlatVectorsFormat extends

Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]

2024-12-17 Thread via GitHub
github-actions[bot] commented on PR #13521: URL: https://github.com/apache/lucene/pull/13521#issuecomment-2549977673 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-12-17 Thread via GitHub
github-actions[bot] commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-2549978746 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] Initialize the dictionary of sorted numeric DV with a known size [lucene]

2024-12-17 Thread via GitHub
github-actions[bot] commented on PR #14035: URL: https://github.com/apache/lucene/pull/14035#issuecomment-2549976714 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] This fixes immutability of clauseSets (broken by #13950) [lucene]

2024-12-17 Thread via GitHub
gsmiller commented on PR #14074: URL: https://github.com/apache/lucene/pull/14074#issuecomment-2549785261 Ah great catch. Thanks @uschindler! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] Add a Better Binary Quantizer format for dense vectors [lucene]

2024-12-17 Thread via GitHub
benwtrent closed pull request #13651: Add a Better Binary Quantizer format for dense vectors URL: https://github.com/apache/lucene/pull/13651 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] Add a Better Binary Quantizer format for dense vectors [lucene]

2024-12-17 Thread via GitHub
benwtrent commented on PR #13651: URL: https://github.com/apache/lucene/pull/13651#issuecomment-2549771070 Closing this PR in deference to this one: https://github.com/apache/lucene/pull/14078 An evolution of scalar quantization proved more flexible and provided better recall in our

[PR] Binary vector format for flat and hnsw vectors [lucene]

2024-12-17 Thread via GitHub
benwtrent opened a new pull request, #14078: URL: https://github.com/apache/lucene/pull/14078 This provides a binary vector format for vectors. The key ideas are: - Centroid centered vectors - Asymmetric quantization - Individually optimized scalar quantization This all

Re: [PR] aws jmh benchmark cleanups [lucene]

2024-12-17 Thread via GitHub
rmuir commented on PR #14072: URL: https://github.com/apache/lucene/pull/14072#issuecomment-2549721537 @ChrisHegarty take another look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [PR] aws jmh benchmark cleanups [lucene]

2024-12-17 Thread via GitHub
rmuir commented on PR #14072: URL: https://github.com/apache/lucene/pull/14072#issuecomment-2549694829 Yeah, i commented out the forwarding and I think i reproduced your issue: ``` fatal: [graviton4]: FAILED! => changed: false cmd: /usr/bin/git ls-remote g...@github.c

Re: [PR] Let `DocIdSetIterator` optimize loading into a FixedBitSet. [lucene]

2024-12-17 Thread via GitHub
jpountz merged PR #14069: URL: https://github.com/apache/lucene/pull/14069 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] aws jmh benchmark cleanups [lucene]

2024-12-17 Thread via GitHub
rmuir commented on PR #14072: URL: https://github.com/apache/lucene/pull/14072#issuecomment-2549619333 OK, we can change that to https. I'm pretty sure it doesn't work for you because no agent was forwarded. i have in my ssh config: ``` AddKeysToAgent yes ForwardAgent yes ```

Re: [PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
eusousu commented on PR #14075: URL: https://github.com/apache/lucene/pull/14075#issuecomment-2549438980 > https://github.com/LibreOffice/dictionaries/pull/46 Their contribution process seem to be elsewhere, but I could not understand it fully 😅 I tried sending the issue to the

Re: [PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
rmuir commented on PR #14075: URL: https://github.com/apache/lucene/pull/14075#issuecomment-2549192152 https://github.com/LibreOffice/dictionaries/pull/46 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

Re: [PR] Make inlining of FixedBitSet#get more predictable when checking live docs. [lucene]

2024-12-17 Thread via GitHub
jpountz commented on PR #14077: URL: https://github.com/apache/lucene/pull/14077#issuecomment-2549190865 To benchmark this change, I applied a (quick and dirty) patch to luceneutil to have a mix of 3 `Bits` implementations to represent live docs, using a `FixedBitSet` on 75% of segments:

[PR] Make inlining of FixedBitSet#get more predictable when checking live docs. [lucene]

2024-12-17 Thread via GitHub
jpountz opened a new pull request, #14077: URL: https://github.com/apache/lucene/pull/14077 This helps make calls sites of `Bits#get` bimorphic at most when checking live docs. This helps because calls to `FixedBitSet#get` can then be inlined when live docs are stored in a `FixedBitSet`. An

Re: [PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
rmuir commented on PR #14075: URL: https://github.com/apache/lucene/pull/14075#issuecomment-2549163081 I will send them a one-liner PR explaining the situation, we can take it from there. We may want to separately try to be more lenient about this part of the parsing. Have not looked

Re: [PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
rmuir commented on PR #14075: URL: https://github.com/apache/lucene/pull/14075#issuecomment-2549147744 Yes, that's it. the `REP 3619` should be changed to `REP 3621`. I guess we could send the PR to libreoffice, since the "upstream" dictionary looks totally different here: https://github.co

Re: [PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
rmuir commented on PR #14075: URL: https://github.com/apache/lucene/pull/14075#issuecomment-2549123863 I'm happy to try to debug this but it might be a few days. Issue may be with REP rules in the referenced commit. the way these rules work are: ``` REP 3619 REP a а REP c с

Re: [PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
eusousu commented on PR #14075: URL: https://github.com/apache/lucene/pull/14075#issuecomment-2549064565 > If you were to capture the full "reproduce with" command line that is output by the test framework You mean this? ``` Reproduce with: gradlew :lucene:analysis:common:test -

Re: [PR] Support disabling IndexSearcher.maxClauseCount with a value of -1 [lucene]

2024-12-17 Thread via GitHub
dweiss commented on PR #13178: URL: https://github.com/apache/lucene/pull/13178#issuecomment-2549067219 I just ran into this issue. Do you think we could revisit this and maybe merge it in? I've hit it with a large query consisting of multiple intervals - there are no "clauses" as such and

Re: [PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
rmuir commented on PR #14075: URL: https://github.com/apache/lucene/pull/14075#issuecomment-2549010607 And yeah i see the commit date, but that's not the push date. So I suspect this issue has nothing to do with your PR and may fail all PRs until we address it. -- This is an automated me

Re: [PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
rmuir commented on PR #14075: URL: https://github.com/apache/lucene/pull/14075#issuecomment-2549005107 It fails again. maybe problem comes from https://github.com/LibreOffice/dictionaries/commit/d1696029d8923ae697cb2d6d4d7d69791b1943f2 ? -- This is an automated message from the Apache Gi

Re: [PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
rmuir commented on PR #14075: URL: https://github.com/apache/lucene/pull/14075#issuecomment-2548996704 I reran that check to see what happens, if it reproduces. This "extra regressions" check is also doing unpinned shallow `git clone` of external dictionaries repositories, so they co

Re: [PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
msokolov commented on PR #14075: URL: https://github.com/apache/lucene/pull/14075#issuecomment-2548911336 > I am unable to understand how my change could impact the analysis on the Mongolian language. I don't understand the connection to your change either, but it looks to me as if M

Re: [PR] Use read advice consistently in the knn vector formats [lucene]

2024-12-17 Thread via GitHub
jimczi commented on code in PR #14076: URL: https://github.com/apache/lucene/pull/14076#discussion_r1888721064 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsWriter.java: ## @@ -282,7 +285,7 @@ public CloseableRandomVectorScorerSupplier mergeOneFie

Re: [PR] Use read advice consistently in the knn vector formats [lucene]

2024-12-17 Thread via GitHub
ChrisHegarty commented on code in PR #14076: URL: https://github.com/apache/lucene/pull/14076#discussion_r1888686757 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsWriter.java: ## @@ -282,7 +285,7 @@ public CloseableRandomVectorScorerSupplier merge

Re: [PR] This fixes immutability of clauseSets (broken by #13950) [lucene]

2024-12-17 Thread via GitHub
uschindler commented on PR #14074: URL: https://github.com/apache/lucene/pull/14074#issuecomment-2548724672 > > For the future: If people submit PRs about making private members which are collections public, always check if the immutability could be violated. This is a major problem in Java

Re: [PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
eusousu commented on PR #14075: URL: https://github.com/apache/lucene/pull/14075#issuecomment-2548665777 Is there something I can investigate further? I tried parsing the error and it seems to relate to a mn_MN dictionary that refers to Mongolian 🤔 ``` While checking /home/runner/

Re: [PR] This fixes immutability of clauseSets (broken by #13950) [lucene]

2024-12-17 Thread via GitHub
mikemccand commented on PR #14074: URL: https://github.com/apache/lucene/pull/14074#issuecomment-2548665493 > For the future: If people submit PRs about making private members which are collections public, always check if the immutability could be violated. This is a major problem in Java w

Re: [PR] Allow reading binary doc values as a RandomAccessInput [lucene]

2024-12-17 Thread via GitHub
jpountz commented on code in PR #13948: URL: https://github.com/apache/lucene/pull/13948#discussion_r188862 ## lucene/core/src/test/org/apache/lucene/util/TestBytesRefArray.java: ## @@ -43,8 +44,17 @@ public void testAppend() throws IOException { for (int i = 0; i < e

Re: [PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
benwtrent commented on PR #14075: URL: https://github.com/apache/lucene/pull/14075#issuecomment-2548624976 While its a simple change, it does change the analysis chain. I wonder if it should stick to Lucene 11 (admittedly, that will not be shipped for a LONG time). I wonder what othe

Re: [PR] This fixes immutability of clauseSets (broken by #13950) [lucene]

2024-12-17 Thread via GitHub
uschindler commented on PR #14074: URL: https://github.com/apache/lucene/pull/14074#issuecomment-2548529464 I cherrypicked in 10.x and 10.1 branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] This fixes immutability of clauseSets (broken by #13950) [lucene]

2024-12-17 Thread via GitHub
uschindler merged PR #14074: URL: https://github.com/apache/lucene/pull/14074 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

[PR] Use read advice consistently in the knn vector formats [lucene]

2024-12-17 Thread via GitHub
jimczi opened a new pull request, #14076: URL: https://github.com/apache/lucene/pull/14076 This change reverts #13985 and makes sure each knn format sticks to a single read advice consistently. Switching read advice during merges might help some use cases, but it can also hurt others—e.

[PR] Update stopwords.txt [lucene]

2024-12-17 Thread via GitHub
eusousu opened a new pull request, #14075: URL: https://github.com/apache/lucene/pull/14075 ### Description In brazillian portuguese the conjuntion "em(preposition)+(article)" take the form "na, nas, no, nos" being commom stop words. For some reason the "nas" conjunction appea

Re: [PR] This fixes immutability of clauseSets (broken by #13950) [lucene]

2024-12-17 Thread via GitHub
uschindler commented on PR #14074: URL: https://github.com/apache/lucene/pull/14074#issuecomment-2548460187 P.S.: I added the `Collections.immutableCollection` only in the getter, because making `EnumMap clauseSets` does not work: - The inner values of the map are custom classes. As we ha

Re: [I] Missing word on Brazillian stop word list [lucene]

2024-12-17 Thread via GitHub
eusousu commented on issue #14065: URL: https://github.com/apache/lucene/issues/14065#issuecomment-2548452970 It's included on the Portuguese [stopwords.txt](https://github.com/apache/lucene/blob/0203815/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/portuguese_stop

Re: [PR] Make some BooleanQuery methods public and a new `#add(Collection)` method for BQ builder [lucene]

2024-12-17 Thread via GitHub
uschindler commented on code in PR #13950: URL: https://github.com/apache/lucene/pull/13950#discussion_r1888513047 ## lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java: ## @@ -136,7 +152,7 @@ public List clauses() { } /** Return the collection of queries fo

Re: [I] debug what happened with 14031 [lucene]

2024-12-17 Thread via GitHub
rmuir commented on issue #14042: URL: https://github.com/apache/lucene/issues/14042#issuecomment-2548442349 I think we are having a communication issue over terminology. I don't care about unrolling, i care about superscalar execution. JVM doesn't allow it, which means the hardware sits the

[PR] Add comment to the SIMD intersection logic. [lucene]

2024-12-17 Thread via GitHub
jpountz opened a new pull request, #14073: URL: https://github.com/apache/lucene/pull/14073 I did not know it when I checked in the code, but this is almost exactly the v1 intersection algorithm from the "SIMD compression and the intersection of sorted integers" paper. -- This is an auto

Re: [PR] Make some BooleanQuery methods public and a new `#add(Collection)` method for BQ builder [lucene]

2024-12-17 Thread via GitHub
uschindler commented on code in PR #13950: URL: https://github.com/apache/lucene/pull/13950#discussion_r1888434442 ## lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java: ## @@ -136,7 +152,7 @@ public List clauses() { } /** Return the collection of queries fo

Re: [PR] Make some BooleanQuery methods public and a new `#add(Collection)` method for BQ builder [lucene]

2024-12-17 Thread via GitHub
uschindler commented on code in PR #13950: URL: https://github.com/apache/lucene/pull/13950#discussion_r1888404168 ## lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java: ## @@ -136,7 +152,7 @@ public List clauses() { } /** Return the collection of queries fo

Re: [PR] Make some BooleanQuery methods public and a new `#add(Collection)` method for BQ builder [lucene]

2024-12-17 Thread via GitHub
uschindler commented on code in PR #13950: URL: https://github.com/apache/lucene/pull/13950#discussion_r1888387748 ## lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java: ## @@ -136,7 +152,7 @@ public List clauses() { } /** Return the collection of queries fo

Re: [PR] Make some BooleanQuery methods public and a new `#add(Collection)` method for BQ builder [lucene]

2024-12-17 Thread via GitHub
uschindler commented on code in PR #13950: URL: https://github.com/apache/lucene/pull/13950#discussion_r1888387748 ## lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java: ## @@ -136,7 +152,7 @@ public List clauses() { } /** Return the collection of queries fo