Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-05 Thread via GitHub


mikemccand commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2701096339

   > I guess thius 94GB comes from `33M x 768 x 4` bytes? Frankly I never test 
with indexes > ~2M docs, but maybe there is a call for the 33M-doc index in 
nightlies?
   
   Yeah ... nightly benchy builds a ~33M vectors index.  I think it's helpful 
to confirm we can index/search decent sized KNN indices ...
   
   Maybe we just leave the 94 GB Cohere vectors source on my s3 bucket for now? 
 We can reassess if it becomes a problem ... `luceneutil` `initial_setup.py` 
should already be downloading from there...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-05 Thread via GitHub


mikemccand commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2701189251

   @dweiss I can't tell from above -- are there other corpora that need a home 
still?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]

2025-03-05 Thread via GitHub


renatoh commented on PR #14311:
URL: https://github.com/apache/lucene/pull/14311#issuecomment-2701699595

   > @renatoh that seems fine, If you have a way to do it so it works. Because 
they are just booleans I wasn't sure?
   > 
   > I also can't remember if there is a way that you can signal a deprecation 
from parameter specified in the TokenFilterFactory (even if its just a hook 
method for e.g. solr, elasticsearch, etc to use). Maybe @uschindler knows.
   the constructor with both,onlyLongestMatch and 
onlyLongestMatchIgnoreSubwords is now deprecated. please have a look if this 
makes sense to you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]

2025-03-05 Thread via GitHub


rmuir commented on PR #14311:
URL: https://github.com/apache/lucene/pull/14311#issuecomment-2701555710

   @renatoh that seems fine, If you have a way to do it so it works. Because 
they are just booleans I wasn't sure?
   
   I also can't remember if there is a way that you can signal a deprecation 
from parameter specified in the TokenFilterFactory (even if its just a hook 
method for e.g. solr, elasticsearch, etc to use). Maybe @uschindler knows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup merging of HNSW graphs [lucene]

2025-03-05 Thread via GitHub


mayya-sharipova commented on PR #14331:
URL: https://github.com/apache/lucene/pull/14331#issuecomment-2701756507

   Evaluation is done with Luceneutil on these datasets:
   Rebased against Lucene main branch:
   
   
   1. **quora-E5-small**; 522931 docs; 384 dims; 7 bits quantized; cosine metric
   
  - baseline: index time: **112.41s**,  force merge: **113.81s**
   
  - candidate: index time: **77.17s**, force merge: **56.01s**
   
   2. **cohere-wikipedia-v2**; 1M docs; 768 dims; 7 bits quantized; cosine 
metric
   
  - baseline: index time: **158.1s**, force merge: **425.20s**
   
  - candidate: index time: **113.68s**, force merge: **201.52s**
   
   3. **gist**; 960 dims, 1M docs; 7 bits quantized; euclidean metric
   
  - baseline: index time: **141.82s**, force merge: **536.07s**
   
  - candidate: index time: **108.77s**, force merge: **279.05s**
   
   4. **cohere-wikipedia-v3**; 1M docs; 1024 dims; 7 bits quantized; 
dot_product metric
   
  - baseline: index time: **211.86s**, force merge: **654.97s**
   
  - candidate: index time: **161.51s,** force merge: **320.54s**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Speedup merging of HNSW graphs [lucene]

2025-03-05 Thread via GitHub


mayya-sharipova opened a new pull request, #14331:
URL: https://github.com/apache/lucene/pull/14331

   Currently when doing merging of HNSW graphs incrementally, 
   we first initialize a graph from the biggest segment, and for other segments,
   we rebuild the graphs completely by going through a segment's vector values
   one by one, searching for it in the new graph to find best neighbours to 
connect with.
   
   This PR proposes more efficient merging based on the idea if we know where 
   we want to insert a node, we have a good idea of where we want to insert its 
neighbours.
   Similarly to the current approach, we initialize a new graph from a graph 
from the
   biggest segments. For all other segments, we find a smaller set of nodes
   that "covers" their graph, and we insert that set as usual. For other nodes,
   outside of J set, we do lighter searches with calculated eps.
   
   This allows substantial speedups.
   
   
   The algorithm is based on the following steps:
   
   1. Get all graphs that don't have deletions and sort them by size 
(descending).
   2. Copy the largest graph to the new graph (`gL`).
   3. For each remaining small graph (`gS`):
  - Find the nodes that best cover `gS` (join set `j`). These nodes will be 
inserted into `gL`  
as usual: by searching `gL` to find the best candidates (`w`) to which 
connect the nodes.
  - For each remaining node in `gS`:
- We provide `eps` to search in `gL`. We form `eps` by the union of the 
node's  
  neighbors in `gS` and the node's neighbors' neighbors in `gL`. We 
also limit  
  `beamWidth` (`efConstruction` to `M * 2`).
   
   
   
   
   Algorithm designed by Thomas Veasey


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]

2025-03-05 Thread via GitHub


uschindler commented on PR #14311:
URL: https://github.com/apache/lucene/pull/14311#issuecomment-2701841157

   > @renatoh that seems fine, If you have a way to do it so it works. Because 
they are just booleans I wasn't sure?
   > 
   > I also can't remember if there is a way that you can signal a deprecation 
from parameter specified in the TokenFilterFactory (even if its just a hook 
method for e.g. solr, elasticsearch, etc to use). Maybe @uschindler knows.
   
   We have no mechanism for deprecating those map keys for the factory.
   
   We can only log a warning, but have no standard way for this yet.
   
   Solr can only detect deprecated factories (it checks for the `@Deprecated` 
annotation and logs are warning).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] introduce new parameter onlyLongestMatchNoSubwords replacing onlyLongestMatch [lucene]

2025-03-05 Thread via GitHub


renatoh commented on PR #14311:
URL: https://github.com/apache/lucene/pull/14311#issuecomment-2701225973

   @rmuir Sorry for rushing, but have you seen my suggestion regarding 
deprecating the constructor?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

2025-03-05 Thread via GitHub


lpld commented on PR #14078:
URL: https://github.com/apache/lucene/pull/14078#issuecomment-2702238939

   @benwtrent Thanks for your response, it was quite helpful. Could you please 
also share other parameters of your benchmark (ndoc, maxConn, beamWidthIndex, 
fanout, etc.) ?
   
   I was able to come close to your results finally with ndoc=500_000, 
maxConn=64, beamWidthIndex=250, oversample=5f.
   
   Also, a short question. What is the purpose of having both `oversample` and 
`fanout` parameters? At first glance it seems like they are doing almost the 
same thing.
   
   Appreciate your help and time!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup merging of HNSW graphs [lucene]

2025-03-05 Thread via GitHub


msokolov commented on PR #14331:
URL: https://github.com/apache/lucene/pull/14331#issuecomment-2701861504

   oh, this is a neat idea! Looks like we sacrifice some query performance (in 
some cases) for a big improvement in indexing time. I wonder if we've tried 
other values of `beamWidth` to see if we can recover the query performance?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Add a workflow generating gh stats for board summary reports [lucene]

2025-03-05 Thread via GitHub


dweiss opened a new issue, #14332:
URL: https://github.com/apache/lucene/issues/14332

   ### Description
   
   Apache's reporting utility is currently broken. I wrote a small gh workflow 
to generate the stats on demand.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add a workflow generating gh stats for board summary reports [lucene]

2025-03-05 Thread via GitHub


dweiss commented on issue #14332:
URL: https://github.com/apache/lucene/issues/14332#issuecomment-2701997460

   Oops, something is not working with gh cli. Looking.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add a workflow generating gh stats for board summary reports [lucene]

2025-03-05 Thread via GitHub


dweiss commented on issue #14332:
URL: https://github.com/apache/lucene/issues/14332#issuecomment-2701993756

   https://github.com/apache/lucene/actions/workflows/activity-report.yml
   
   You have to run it manually, providing the time window:
   
   
![Image](https://github.com/user-attachments/assets/910d7a1b-cd2e-47c7-9593-6371acfee660)
   
   The workflow's result contains the summary data for the input period 
(https://github.com/apache/lucene/actions/runs/13684908942):
   
   
![Image](https://github.com/user-attachments/assets/fd0c9a5a-63ba-414c-a3ae-652ddcca0467)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add a workflow generating gh stats for board summary reports [lucene]

2025-03-05 Thread via GitHub


dweiss closed issue #14332: Add a workflow generating gh stats for board 
summary reports
URL: https://github.com/apache/lucene/issues/14332


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] remove refs to people.apache.org/home.apache.org in build [lucene]

2025-03-05 Thread via GitHub


dweiss commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2702007572

   I've downloaded and moved all those data sets that were present in gradle 
build files (specifically, in external-datasets.gradle). If there is anything 
else I should place there, let me know. I agree those humongous files can 
probably stay on your AWS/benchmark machine since not many people on Earth have 
the processing power to deal with it... Certainly Robert's laptop isn't on the 
list of eligible hardware. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add a workflow generating gh stats for board summary reports [lucene]

2025-03-05 Thread via GitHub


dweiss closed issue #14332: Add a workflow generating gh stats for board 
summary reports
URL: https://github.com/apache/lucene/issues/14332


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add a workflow generating gh stats for board summary reports [lucene]

2025-03-05 Thread via GitHub


dweiss commented on issue #14332:
URL: https://github.com/apache/lucene/issues/14332#issuecomment-2702057560

   Ok, working now.
   
   
![Image](https://github.com/user-attachments/assets/3ab11c13-69c2-404d-85d2-5c604da545fe)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] develocity build scans fail to upload sometimes [lucene]

2025-03-05 Thread via GitHub


dweiss commented on issue #14305:
URL: https://github.com/apache/lucene/issues/14305#issuecomment-2702086799

   Still not working -
   
https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/lastBuild/console
   
   ```
   The Develocity server (develocity.apache.org) rejected the access key with 
prefix 
'eyJraWQiOiJhdXRoLWtleS1pZC1lOWE1NDI0MC0wN2ZhLTRjNDQtOGUzNC0zNzIwMDU5NWM3NzgiLCJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc19hbm9ueW1vdXMiOmZhbHNlLCJwZXJtaXNzaW9ucyI6WyJWSUVXX1NDQU4iLCJFWFBPUlRfREFUQSIsIlBVQkxJU0hfU0NBTiIsIlRFU1RfRElTVFJJQlVUSU9OIiwiUFJFRElDVElWRV9URVNUX1NFTEVDVElPTiIsIlJFQURfQ0FDSEUiLCJBQ0NFU1NfQUxMX0RBVEEiLCJBQ0NFU1NfREFUQV9XSVRIT1VUX0FTU09DSUFURURfUFJPSkVDVCIsIk1ZX1NFVFRJTkdTIiwiUkVBRF9WRVJTSU9OIl0sImFsbERhdGFQZXJtaXNzaW9ucyI6WyJWSUVXX1NDQU4iLCJQUkVESUNUSVZFX1RFU1RfU0VMRUNUSU9OIiwiUkVBRF9DQUNIRSIsIkFDQ0VTU19EQVRBX1dJVEhPVVRfQVNTT0NJQVRFRF9QUk9KRUNUIiwiRVhQT1JUX0RBVEEiLCJQVUJMSVNIX1NDQU4iLCJURVNUX0RJU1RSSUJVVElPTiIsIkFDQ0VTU19BTExfREFUQSIsIk1ZX1NFVFRJTkdTIiwiUkVBRF9WRVJTSU'.
   Your access key has expired - please visit 
https://develocity.apache.org/settings/access-keys to refresh it.
   ```
   
   Not sure what's happening here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

2025-03-05 Thread via GitHub


github-actions[bot] commented on PR #13914:
URL: https://github.com/apache/lucene/pull/13914#issuecomment-2702387083

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org