Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]
mikemccand commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2701096339 > I guess thius 94GB comes from `33M x 768 x 4` bytes? Frankly I never test with indexes > ~2M docs, but maybe there is a call for the 33M-doc index in nightlies? Yeah ... nightly benchy builds a ~33M vectors index. I think it's helpful to confirm we can index/search decent sized KNN indices ... Maybe we just leave the 94 GB Cohere vectors source on my s3 bucket for now? We can reassess if it becomes a problem ... `luceneutil` `initial_setup.py` should already be downloading from there... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]
mikemccand commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2701189251 @dweiss I can't tell from above -- are there other corpora that need a home still? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]
renatoh commented on PR #14311: URL: https://github.com/apache/lucene/pull/14311#issuecomment-2701699595 > @renatoh that seems fine, If you have a way to do it so it works. Because they are just booleans I wasn't sure? > > I also can't remember if there is a way that you can signal a deprecation from parameter specified in the TokenFilterFactory (even if its just a hook method for e.g. solr, elasticsearch, etc to use). Maybe @uschindler knows. the constructor with both,onlyLongestMatch and onlyLongestMatchIgnoreSubwords is now deprecated. please have a look if this makes sense to you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]
rmuir commented on PR #14311: URL: https://github.com/apache/lucene/pull/14311#issuecomment-2701555710 @renatoh that seems fine, If you have a way to do it so it works. Because they are just booleans I wasn't sure? I also can't remember if there is a way that you can signal a deprecation from parameter specified in the TokenFilterFactory (even if its just a hook method for e.g. solr, elasticsearch, etc to use). Maybe @uschindler knows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup merging of HNSW graphs [lucene]
mayya-sharipova commented on PR #14331: URL: https://github.com/apache/lucene/pull/14331#issuecomment-2701756507 Evaluation is done with Luceneutil on these datasets: Rebased against Lucene main branch: 1. **quora-E5-small**; 522931 docs; 384 dims; 7 bits quantized; cosine metric - baseline: index time: **112.41s**, force merge: **113.81s** - candidate: index time: **77.17s**, force merge: **56.01s** 2. **cohere-wikipedia-v2**; 1M docs; 768 dims; 7 bits quantized; cosine metric - baseline: index time: **158.1s**, force merge: **425.20s** - candidate: index time: **113.68s**, force merge: **201.52s** 3. **gist**; 960 dims, 1M docs; 7 bits quantized; euclidean metric - baseline: index time: **141.82s**, force merge: **536.07s** - candidate: index time: **108.77s**, force merge: **279.05s** 4. **cohere-wikipedia-v3**; 1M docs; 1024 dims; 7 bits quantized; dot_product metric - baseline: index time: **211.86s**, force merge: **654.97s** - candidate: index time: **161.51s,** force merge: **320.54s** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Speedup merging of HNSW graphs [lucene]
mayya-sharipova opened a new pull request, #14331: URL: https://github.com/apache/lucene/pull/14331 Currently when doing merging of HNSW graphs incrementally, we first initialize a graph from the biggest segment, and for other segments, we rebuild the graphs completely by going through a segment's vector values one by one, searching for it in the new graph to find best neighbours to connect with. This PR proposes more efficient merging based on the idea if we know where we want to insert a node, we have a good idea of where we want to insert its neighbours. Similarly to the current approach, we initialize a new graph from a graph from the biggest segments. For all other segments, we find a smaller set of nodes that "covers" their graph, and we insert that set as usual. For other nodes, outside of J set, we do lighter searches with calculated eps. This allows substantial speedups. The algorithm is based on the following steps: 1. Get all graphs that don't have deletions and sort them by size (descending). 2. Copy the largest graph to the new graph (`gL`). 3. For each remaining small graph (`gS`): - Find the nodes that best cover `gS` (join set `j`). These nodes will be inserted into `gL` as usual: by searching `gL` to find the best candidates (`w`) to which connect the nodes. - For each remaining node in `gS`: - We provide `eps` to search in `gL`. We form `eps` by the union of the node's neighbors in `gS` and the node's neighbors' neighbors in `gL`. We also limit `beamWidth` (`efConstruction` to `M * 2`). Algorithm designed by Thomas Veasey -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]
uschindler commented on PR #14311: URL: https://github.com/apache/lucene/pull/14311#issuecomment-2701841157 > @renatoh that seems fine, If you have a way to do it so it works. Because they are just booleans I wasn't sure? > > I also can't remember if there is a way that you can signal a deprecation from parameter specified in the TokenFilterFactory (even if its just a hook method for e.g. solr, elasticsearch, etc to use). Maybe @uschindler knows. We have no mechanism for deprecating those map keys for the factory. We can only log a warning, but have no standard way for this yet. Solr can only detect deprecated factories (it checks for the `@Deprecated` annotation and logs are warning). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]
renatoh commented on PR #14311: URL: https://github.com/apache/lucene/pull/14311#issuecomment-2701225973 @rmuir Sorry for rushing, but have you seen my suggestion regarding deprecating the constructor? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Binary vector format for flat and hnsw vectors [lucene]
lpld commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2702238939 @benwtrent Thanks for your response, it was quite helpful. Could you please also share other parameters of your benchmark (ndoc, maxConn, beamWidthIndex, fanout, etc.) ? I was able to come close to your results finally with ndoc=500_000, maxConn=64, beamWidthIndex=250, oversample=5f. Also, a short question. What is the purpose of having both `oversample` and `fanout` parameters? At first glance it seems like they are doing almost the same thing. Appreciate your help and time! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup merging of HNSW graphs [lucene]
msokolov commented on PR #14331: URL: https://github.com/apache/lucene/pull/14331#issuecomment-2701861504 oh, this is a neat idea! Looks like we sacrifice some query performance (in some cases) for a big improvement in indexing time. I wonder if we've tried other values of `beamWidth` to see if we can recover the query performance? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Add a workflow generating gh stats for board summary reports [lucene]
dweiss opened a new issue, #14332: URL: https://github.com/apache/lucene/issues/14332 ### Description Apache's reporting utility is currently broken. I wrote a small gh workflow to generate the stats on demand. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a workflow generating gh stats for board summary reports [lucene]
dweiss commented on issue #14332: URL: https://github.com/apache/lucene/issues/14332#issuecomment-2701997460 Oops, something is not working with gh cli. Looking. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a workflow generating gh stats for board summary reports [lucene]
dweiss commented on issue #14332: URL: https://github.com/apache/lucene/issues/14332#issuecomment-2701993756 https://github.com/apache/lucene/actions/workflows/activity-report.yml You have to run it manually, providing the time window:  The workflow's result contains the summary data for the input period (https://github.com/apache/lucene/actions/runs/13684908942):  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a workflow generating gh stats for board summary reports [lucene]
dweiss closed issue #14332: Add a workflow generating gh stats for board summary reports URL: https://github.com/apache/lucene/issues/14332 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]
dweiss commented on issue #13647: URL: https://github.com/apache/lucene/issues/13647#issuecomment-2702007572 I've downloaded and moved all those data sets that were present in gradle build files (specifically, in external-datasets.gradle). If there is anything else I should place there, let me know. I agree those humongous files can probably stay on your AWS/benchmark machine since not many people on Earth have the processing power to deal with it... Certainly Robert's laptop isn't on the list of eligible hardware. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a workflow generating gh stats for board summary reports [lucene]
dweiss closed issue #14332: Add a workflow generating gh stats for board summary reports URL: https://github.com/apache/lucene/issues/14332 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Add a workflow generating gh stats for board summary reports [lucene]
dweiss commented on issue #14332: URL: https://github.com/apache/lucene/issues/14332#issuecomment-2702057560 Ok, working now.  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] develocity build scans fail to upload sometimes [lucene]
dweiss commented on issue #14305: URL: https://github.com/apache/lucene/issues/14305#issuecomment-2702086799 Still not working - https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/lastBuild/console ``` The Develocity server (develocity.apache.org) rejected the access key with prefix 'eyJraWQiOiJhdXRoLWtleS1pZC1lOWE1NDI0MC0wN2ZhLTRjNDQtOGUzNC0zNzIwMDU5NWM3NzgiLCJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc19hbm9ueW1vdXMiOmZhbHNlLCJwZXJtaXNzaW9ucyI6WyJWSUVXX1NDQU4iLCJFWFBPUlRfREFUQSIsIlBVQkxJU0hfU0NBTiIsIlRFU1RfRElTVFJJQlVUSU9OIiwiUFJFRElDVElWRV9URVNUX1NFTEVDVElPTiIsIlJFQURfQ0FDSEUiLCJBQ0NFU1NfQUxMX0RBVEEiLCJBQ0NFU1NfREFUQV9XSVRIT1VUX0FTU09DSUFURURfUFJPSkVDVCIsIk1ZX1NFVFRJTkdTIiwiUkVBRF9WRVJTSU9OIl0sImFsbERhdGFQZXJtaXNzaW9ucyI6WyJWSUVXX1NDQU4iLCJQUkVESUNUSVZFX1RFU1RfU0VMRUNUSU9OIiwiUkVBRF9DQUNIRSIsIkFDQ0VTU19EQVRBX1dJVEhPVVRfQVNTT0NJQVRFRF9QUk9KRUNUIiwiRVhQT1JUX0RBVEEiLCJQVUJMSVNIX1NDQU4iLCJURVNUX0RJU1RSSUJVVElPTiIsIkFDQ0VTU19BTExfREFUQSIsIk1ZX1NFVFRJTkdTIiwiUkVBRF9WRVJTSU'. Your access key has expired - please visit https://develocity.apache.org/settings/access-keys to refresh it. ``` Not sure what's happening here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]
github-actions[bot] commented on PR #13914: URL: https://github.com/apache/lucene/pull/13914#issuecomment-2702387083 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org