Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-03-04 Thread via GitHub
github-actions[bot] commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2699339952 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-03-04 Thread via GitHub
navneet1v commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2699321614 > @benwtrent @navneet1v I wonder if either of you were able to replicate benchmarks? (FYI I also opened [facebookresearch/faiss#4186](https://github.com/facebookresearch/faiss/pull/418

Re: [I] Stop duplicating per-segment work across segment partitions [lucene]

2025-03-04 Thread via GitHub
javanna commented on issue #13745: URL: https://github.com/apache/lucene/issues/13745#issuecomment-2699189350 > HNSW vector search heavy lifting is done in rewrite, so out of scope for this, right? I believe so, mostly because query rewrite does not parallelize on slices, but across

Re: [I] Stop duplicating per-segment work across segment partitions [lucene]

2025-03-04 Thread via GitHub
msokolov commented on issue #13745: URL: https://github.com/apache/lucene/issues/13745#issuecomment-2698550635 HNSW vector search heavy lifting is done in `rewrite`, so out of scope for this, right? Maybe multi-term queries would need to do some work. What about join queries? TermInSet quer

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-04 Thread via GitHub
benwtrent commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2698684914 @lpld here is my Lucene util changes: https://github.com/mikemccand/luceneutil/pull/348 > What exactly do the numbers in the description of this pull request mean? When you say

Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-04 Thread via GitHub
msokolov commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2698449439 I guess thius 94GB comes from 33M*768*4 bytes? Frankly I never test with indexes > ~2M docs, but maybe there is a call for the 33M-doc index in nightlies? -- This is an automate

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-04 Thread via GitHub
lpld commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2698569925 Hi @benwtrent Thanks again for your previous comment. I was able to modify luceneutil and run some benchmarks. I am quite new to lucene, so I would appreciate some help in understan

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-03-04 Thread via GitHub
msokolov commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2698456606 Maybe as a short-term mitigation we should revert or disable the `connectComponents` impl since its supposed improvements are kind of theoretical and it comes with a deadly vulnera

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-03-04 Thread via GitHub
benwtrent commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2698475184 The goal of connectComponents is to help graphs that have gaps in their connectivity. However, when its needed most (e.g. tons of gaps and poor connectivity), it does more harm th

Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-03-04 Thread via GitHub
msokolov commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2698452319 I tried indexing some [NOAA climate data](https://www.ncei.noaa.gov/products/land-based-station/noaa-global-temp) that is four-dimensional (temperature over last 150 years for ever

Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-04 Thread via GitHub
dweiss commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2698418094 Hmm... 100 gb may be stretching Apache Infra's patience... I don't even know if this bucket has a limit of some sort. -- This is an automated message from the Apache Git Service.

Re: [PR] Use read advice consistently in the knn vector formats [lucene]

2025-03-04 Thread via GitHub
jimczi commented on PR #14076: URL: https://github.com/apache/lucene/pull/14076#issuecomment-2698222473 > I do think some APIs like updateReadAdvice and finishMerge are helpful, I would want to see if we want to keep those and have a noop for this use case. this api was added in [Lucene 10

Re: [PR] Create vectorized versions of ScalarQuantizer.quantize and recalculateCorrectiveOffset [lucene]

2025-03-04 Thread via GitHub
benwtrent commented on PR #14304: URL: https://github.com/apache/lucene/pull/14304#issuecomment-2697969649 I compared this branch with main. There are measurable improvements, but the quantization step isn't the main bottle neck. Vector comparisons still dominate the costs. But, its a nice

Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-04 Thread via GitHub
mikemccand commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2697666568 Oooh we have an official S3 bucket to use now? I had already uploaded the benchy corpus files to my own S3 bucket ... I think the URLs are in the setup.py (just renamed to `init

Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-04 Thread via GitHub
mikemccand commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2697675763 > [@mikemccand](https://github.com/mikemccand) would you be able to expose the files [@dsmiley](https://github.com/dsmiley) rescued on your server? oh, hmm, not I haven't y

Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-04 Thread via GitHub
rmuir commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2697599407 @dweiss we could fetch https://whimsy.apache.org/public/public_ldap_people.json and retrieve committer's GPG fingerprint that way? -- This is an automated message from the Apache G

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-04 Thread via GitHub
benwtrent merged PR #14078: URL: https://github.com/apache/lucene/pull/14078 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [I] Writing too many identical vector documents can cause flush blocking [lucene]

2025-03-04 Thread via GitHub
benwtrent commented on issue #14330: URL: https://github.com/apache/lucene/issues/14330#issuecomment-2697394527 This particular (many duplicate vectors) case is handled here: https://github.com/apache/lucene/pull/14215 But the overall issue of connectComponents taking until the "heat

Re: [I] Writing too many identical vector documents can cause flush blocking [lucene]

2025-03-04 Thread via GitHub
benwtrent closed issue #14330: Writing too many identical vector documents can cause flush blocking URL: https://github.com/apache/lucene/issues/14330 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-03-04 Thread via GitHub
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2697323024 @benwtrent @navneet1v I wonder if either of you were able to replicate benchmarks? (FYI I also opened https://github.com/facebookresearch/faiss/pull/4186 to start publishing the C_AP

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-03-04 Thread via GitHub
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2697320627 Summary of latest changes: 1. Added tests! These will only run if `libfaiss_c.so` (along with all dependencies) is present during runtime (in `$LD_LIBRARY_PATH` or `-Djava.library.pa

[I] Writing too many identical vector documents can cause flush blocking [lucene]

2025-03-04 Thread via GitHub
weizijun opened a new issue, #14330: URL: https://github.com/apache/lucene/issues/14330 ### Description I found a serious bad case. When I write all the same vector docs, It will cause flush blocked. The cost comes from the `connectComponents` process. When all vectors are the s

Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

2025-03-04 Thread via GitHub
ChrisHegarty commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2697094951 > My proposal would be: Let's add some key-value pairs of "codec options" like done in Analyzers, that can be passed as part of the IndexWriterConfig (while writing) or passed to Di

Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

2025-03-04 Thread via GitHub
javanna commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2697098397 I agree with everything you wrote above @uschindler ! I can try and update my existing PR targeted at suggestion fields (#14270), following your suggested approach. The PR current

Re: [PR] Avoid using time zones that emit warnings (jdk25+) [lucene]

2025-03-04 Thread via GitHub
uschindler commented on PR #14328: URL: https://github.com/apache/lucene/pull/14328#issuecomment-2697079923 Looks fine! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-04 Thread via GitHub
dweiss commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2696786476 There are two or three references in test files. There is one reference remaining in releaseWizard.py: ``` key_url = "https://home.apache.org/keys/committer/%s.asc"; % id.strip(

Re: [PR] Optimize commit retention policy to maintain only the last 5 commits [lucene]

2025-03-04 Thread via GitHub
DivyanshIITB commented on PR #14325: URL: https://github.com/apache/lucene/pull/14325#issuecomment-2696661645 Thank you for your feedback! I understand your concern that KeepOnlyLastCommit might imply retaining only a single commit. My intention behind modifying this policy to retain the la