Re: [I] HnwsGraph creates disconnected components [lucene]

via GitHub Thu, 16 Nov 2023 06:32:47 -0800


nitirajrathore commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1814530858


   > meaning that we can get same recall for a smaller max-conn value now.
   
   I ran some tests with with max-conn 16 and max-conn = 8 and it seems like 
with [my proposal](https://github.com/apache/lucene/pull/12783/commits), even 
max-conn=8 is better as compared to max-conn=16 of mainline. I will also add 
more stats. 
   
   I diverged from `main` branch at commit 
`f64bb19697708bfd91e05ff4314976c991f60cbc` (15 Oct). I haven't merged back from 
there. I think this is `Lucene > 9.7.0`. But will confirm. 
   
   --- 
   `main` : Commit ID : f64bb19697708bfd91e05ff4314976c991f60cbc with max-conn 
= 16
   
   |recall |avgCpuTime     |numDocs        |fanout |maxConn        |beamWidth   
   |totalVisited   |reindexTimeMsec        |selectivity    |prefilter|
   |---|---|---|---|---|---|---|---|---|---|
   |0.451  |17.22  |1000000        |0      |16     |100    |10     |406166 
|1.00   |post-filter|
   
   
   candidate with max conn = 16
   
   |recall |avgCpuTime     |numDocs        |fanout |maxConn        |beamWidth   
   |totalVisited   |reindexTimeMsec        |selectivity    |prefilter|
   |---|---|---|---|---|---|---|---|---|---|
   |0.595 (+32%) |24.19 (+40%) |1000000        |0      |16     |100    |10     
|581090 (+43%) |1.00   |post-filter|
   
   
   candidate with max conn = 8
   
   |recall |avgCpuTime     |numDocs        |fanout |maxConn        |beamWidth   
   |totalVisited   |reindexTimeMsec        |selectivity    |prefilter|
   |---|---|---|---|---|---|---|---|---|---|
   |0.465 (+3%) |16.35 (- 5%)  |1000000        |0      |8      |100    |10     
|325321 (-20%) |1.00   |post-filter|
   
   ---
   
   Interesting fact: simple implementation of using 2 for loops to find the 
common neighbours works better than using HashSet<Integer> or IntIntHashMap(). 
As I think the major contribution to indexing time is because of increased 
number of connections. 
   But as shown above, the indexing time decreases drastically by decreasing 
max-conn, while still maintaining or slightly improving the recall and search 
avgCpuTIme.
   
   I will update with more scripts + info/stats and some code improvements 
next. 
   Also, I am thinking I should do a 1-1 comparison of the 3 heuristics as 
mentioned in the paper, with the level of disconnecteness in each approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] HnwsGraph creates disconnected components [lucene]

Reply via email to