benwtrent closed issue #14330: Writing too many identical vector documents can
cause flush blocking
URL: https://github.com/apache/lucene/issues/14330
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to
mikemccand commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2697675763
> [@mikemccand](https://github.com/mikemccand) would you be able to expose
the files [@dsmiley](https://github.com/dsmiley) rescued on your server?
oh, hmm, not I haven't y
benwtrent commented on PR #14304:
URL: https://github.com/apache/lucene/pull/14304#issuecomment-2697969649
I compared this branch with main. There are measurable improvements, but the
quantization step isn't the main bottle neck. Vector comparisons still dominate
the costs. But, its a nice
benwtrent merged PR #14078:
URL: https://github.com/apache/lucene/pull/14078
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscr...@lucene.a
rmuir commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2697599407
@dweiss we could fetch
https://whimsy.apache.org/public/public_ldap_people.json and retrieve
committer's GPG fingerprint that way?
--
This is an automated message from the Apache G
mikemccand commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2697666568
Oooh we have an official S3 bucket to use now? I had already uploaded the
benchy corpus files to my own S3 bucket ... I think the URLs are in the
setup.py (just renamed to `init
benwtrent commented on issue #14330:
URL: https://github.com/apache/lucene/issues/14330#issuecomment-2697394527
This particular (many duplicate vectors) case is handled here:
https://github.com/apache/lucene/pull/14215
But the overall issue of connectComponents taking until the "heat
msokolov commented on issue #13745:
URL: https://github.com/apache/lucene/issues/13745#issuecomment-2698550635
HNSW vector search heavy lifting is done in `rewrite`, so out of scope for
this, right? Maybe multi-term queries would need to do some work. What about
join queries? TermInSet quer
msokolov commented on issue #14214:
URL: https://github.com/apache/lucene/issues/14214#issuecomment-2698456606
Maybe as a short-term mitigation we should revert or disable the
`connectComponents` impl since its supposed improvements are kind of
theoretical and it comes with a deadly vulnera
msokolov commented on issue #14214:
URL: https://github.com/apache/lucene/issues/14214#issuecomment-2698452319
I tried indexing some [NOAA climate
data](https://www.ncei.noaa.gov/products/land-based-station/noaa-global-temp)
that is four-dimensional (temperature over last 150 years for ever
lpld commented on PR #14078:
URL: https://github.com/apache/lucene/pull/14078#issuecomment-2698569925
Hi @benwtrent
Thanks again for your previous comment. I was able to modify luceneutil and
run some benchmarks. I am quite new to lucene, so I would appreciate some help
in understan
dweiss commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2698418094
Hmm... 100 gb may be stretching Apache Infra's patience... I don't even know
if this bucket has a limit of some sort.
--
This is an automated message from the Apache Git Service.
jimczi commented on PR #14076:
URL: https://github.com/apache/lucene/pull/14076#issuecomment-2698222473
> I do think some APIs like updateReadAdvice and finishMerge are helpful, I
would want to see if we want to keep those and have a noop for this use case.
this api was added in [Lucene
10
github-actions[bot] commented on PR #14203:
URL: https://github.com/apache/lucene/pull/14203#issuecomment-2699339952
This PR has not had activity in the past 2 weeks, labeling it as stale. If
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you
for your contributi
javanna commented on issue #13745:
URL: https://github.com/apache/lucene/issues/13745#issuecomment-2699189350
> HNSW vector search heavy lifting is done in rewrite, so out of scope for
this, right?
I believe so, mostly because query rewrite does not parallelize on slices,
but across
navneet1v commented on PR #14178:
URL: https://github.com/apache/lucene/pull/14178#issuecomment-2699321614
> @benwtrent @navneet1v I wonder if either of you were able to replicate
benchmarks? (FYI I also opened
[facebookresearch/faiss#4186](https://github.com/facebookresearch/faiss/pull/418
uschindler commented on PR #14328:
URL: https://github.com/apache/lucene/pull/14328#issuecomment-2697079923
Looks fine!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To u
dweiss commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2696786476
There are two or three references in test files. There is one reference
remaining in releaseWizard.py:
```
key_url = "https://home.apache.org/keys/committer/%s.asc"; % id.strip(
DivyanshIITB commented on PR #14325:
URL: https://github.com/apache/lucene/pull/14325#issuecomment-2696661645
Thank you for your feedback! I understand your concern that
KeepOnlyLastCommit might imply retaining only a single commit. My intention
behind modifying this policy to retain the la
javanna commented on PR #14275:
URL: https://github.com/apache/lucene/pull/14275#issuecomment-2697098397
I agree with everything you wrote above @uschindler !
I can try and update my existing PR targeted at suggestion fields (#14270),
following your suggested approach. The PR current
ChrisHegarty commented on PR #14275:
URL: https://github.com/apache/lucene/pull/14275#issuecomment-2697094951
> My proposal would be: Let's add some key-value pairs of "codec options"
like done in Analyzers, that can be passed as part of the IndexWriterConfig
(while writing) or passed to Di
weizijun opened a new issue, #14330:
URL: https://github.com/apache/lucene/issues/14330
### Description
I found a serious bad case. When I write all the same vector docs, It will
cause flush blocked.
The cost comes from the `connectComponents` process.
When all vectors are the s
kaivalnp commented on PR #14178:
URL: https://github.com/apache/lucene/pull/14178#issuecomment-2697323024
@benwtrent @navneet1v I wonder if either of you were able to replicate
benchmarks?
(FYI I also opened https://github.com/facebookresearch/faiss/pull/4186 to
start publishing the C_AP
kaivalnp commented on PR #14178:
URL: https://github.com/apache/lucene/pull/14178#issuecomment-2697320627
Summary of latest changes:
1. Added tests! These will only run if `libfaiss_c.so` (along with all
dependencies) is present during runtime (in `$LD_LIBRARY_PATH` or
`-Djava.library.pa
msokolov commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2698449439
I guess thius 94GB comes from 33M*768*4 bytes? Frankly I never test with
indexes > ~2M docs, but maybe there is a call for the 33M-doc index in
nightlies?
--
This is an automate
benwtrent commented on PR #14078:
URL: https://github.com/apache/lucene/pull/14078#issuecomment-2698684914
@lpld here is my Lucene util changes:
https://github.com/mikemccand/luceneutil/pull/348
> What exactly do the numbers in the description of this pull request mean?
When you say
benwtrent commented on issue #14214:
URL: https://github.com/apache/lucene/issues/14214#issuecomment-2698475184
The goal of connectComponents is to help graphs that have gaps in their
connectivity. However, when its needed most (e.g. tons of gaps and poor
connectivity), it does more harm th
27 matches
Mail list logo