kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r935382898


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   > This seems like a nice follow-up if you're up for it! For this PR, it's no 
problem with me to just use `Math.random()`.
   
   Hey, I did some follow-up on this, and here's the sample 
[code](https://github.com/kaivalnp/lucene/tree/refactor_graph_tester)
   
   Some key points:
   - Separate indexing and search operations
   - Allow passing index path instead (we can now test HNSW search on any 
existing index, even those not created by `KnnGraphTest`. This can be 
beneficial for testing Knn search time on custom indexes)
   - Reduced redundant arguments (we needed to pass `maxConn`, `beamWidth`, 
`dim` etc even for searching, which aren't required/should be inferred from 
index automatically)
   - Ability to pass a `seed` (for reproducibility), `maxSegments` (if one 
wants a single segment index; this was enforced earlier, can be optional), 
`knnField` (while searching in custom indexes), `cache` (to save on brute force 
search time using precomputed results)
   - Considered corner cases where if `selectivity` is high, `topK` results may 
not be found (current implementation requires topK results as far as I know)
   
   Indexing Params:
   ```
   -Doperation=index
   -Ddocs=(path of vec file containing docs)
   -Ddim=(dimension of doc vectors)
   -DnumDocs=(number of vectors)
   -Dindex=(index path to be created)
   -DknnField=(knn field name in index; optional, defaults to `knn`)
   -DmaxConn=(`maxConn` used for indexing)
   -DbeamWidth=(`beamWidth` used for indexing)
   -DmaxSegments=(max segments desired; optional, defaults to no merges)
   ```
   
   Search Params:
   ```
   -Doperation=search
   -Dindex=(path of index; `dim` will be inferred)
   -DknnField=knn field name in index; optional, defaults to `knn`)
   -Dqueries=(path of vec file containing queries)
   -DnumQueries=(number of queries to run)
   -Dcache=(path to cache; read from cache if found, else compute and write new)
   -DtopK=(desired `topK`)
   -DfilterSelectivity=(selectivity of filter; optional, defaults to 1)
   -Dseed=(seed; optional)
   ```
   
   Yet to be added:
   - Index `info` operation (some segment info?)
   - pre/post filter (currently pre-filters by default, add option for 
over-selection + post-filter)
   - some output formatting (only basic details now)
   
   Some considerations:
   - Not extended `LuceneTestCase` as being a unit test, it can only access 
limited resources and write to temporary files. However, incorporated a `seed` 
argument for reproducibility
   - Shifted to JVM arguments for cleaner code (directly access property, no 
boilerplate required)
   
   Please let me know if this is in the right direction.. (or if I missed 
something)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to