kaivalnp opened a new pull request, #1059:
URL: https://github.com/apache/lucene/pull/1059

   `KnnGraphTester` has some drawbacks and needs a refactor because:
   - Can only test HNSW search time on graphs created by itself (cannot work 
easily with existing / custom indexes)
   - Some parameters (like maxConn, beamWidth, dim, etc.) need to be passed 
even when searching (which are unnecessary / can be inferred from the index)
   - Does not consider corner cases (where topK results cannot be found when 
filter is very selective)
   - Not reproducible for selective filters (since it is random)
   
   Some proposed points:
   - Use index path for search instead (even those not created by 
`KnnGraphTester`). We don't need the original `docs` used to create index as 
all vectors are already indexed and accessible via `getVectorValues`
   - Reduced redundant arguments which aren't required / should be inferred 
automatically
   - Ability to pass a `seed` (for reproducibility), `maxSegments` (if one 
wants a single segment index; this was enforced earlier, can be optional), 
`knnField` (while searching in custom indexes), `cache` (to save on brute force 
search time using precomputed results)
   - Changed serializing / de-serializing of precomputed results to consider 
corner cases (less than `topK` results)
   
   Indexing Params:
   
   ```
   Required
   -Doperation=index
   -Ddocs=(path of vec file containing docs)
   -Ddim=(dimension of doc vectors)
   -DnumDocs=(number of vectors)
   -Dindex=(index path to be created)
   -DmaxConn=(`maxConn` used for indexing)
   -DbeamWidth=(`beamWidth` used for indexing)
   
   Optional
   -DknnField=(knn field name in index; defaults to `knn`)
   -Dfunction=(similarity function to be used `DOT_PRODUCT` | `EUCLIDEAN`; 
defaults to `DOT_PRODUCT`)
   -DmaxSegments=(max segments desired; defaults to no merges)
   ```
   
   Search Params:
   
   ```
   Required
   -Doperation=search
   -Dindex=(path of index; `dim` will be inferred)
   -Dqueries=(path of vec file containing queries)
   -DnumQueries=(number of queries to run)
   -DtopK=(desired `topK`)
   
   Optional
   -Dcache=(path to cache; read from cache if found, else compute and write new)
   -DknnField=(knn field name in index; defaults to `knn`)
   -Dfanout=(desired `fanout`; defaults to 0)
   -DfilterSelectivity=(selectivity of filter; defaults to 1)
   -Dseed=(seed)
   ```
   
   Some considerations:
   
   - Not extended `LuceneTestCase` as being a JUnit test, it has limited read 
access (only `resources` folder) and write to temp folders. This is not very 
useful when working with existing indexes / caches. However, added a `seed` 
argument for reproducibility
   - Shifted to JVM arguments for cleaner code (directly access property, no 
boilerplate required)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to