benchaplin opened a new pull request, #13984:
URL: https://github.com/apache/lucene/pull/13984

   ### Description
   
   This is a first pass on #13724 - I've added the following checks:
   - verify neighbor list is in order 
   - verify no duplicate neighbors
   
   The code was adapted from 
[KnnGraphTester](https://github.com/mikemccand/luceneutil/blob/3a6235c2038de6c6d5d0575d8f58cf06c2836793/src/main/knn/KnnGraphTester.java#L521)
 in luceneutil.
   
   Here's some test results from an index containing the 
cohere-wikipedia-docs-768d dataset:
   ```
       test: open reader.........OK [took 0.010 sec]
       test: check integrity.....OK [took 2.207 sec]
       test: check live docs.....OK [took 0.000 sec]
       test: field infos.........OK [2 fields] [took 0.000 sec]
       test: field norms.........OK [0 fields] [took 0.000 sec]
       test: terms, freq, prox...    test: stored fields.......OK [1500000 
total field count; avg 1.0 fields per doc] [took 0.398 sec]
       test: term vectors........OK [0 total term vector count; avg 0.0 
term/freq vector fields per doc] [took 0.000 sec]
       test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET; 0 SKIPPING INDEX] [took 0.000 sec]
       test: points..............OK [0 fields, 0 points] [took 0.000 sec]
       test: vectors.............OK [1 fields, 1500000 vectors] [took 0.497 sec]
       test: hnsw graph..........OK [4 levels, 1547684 nodes (over all levels)] 
[took 0.485 sec]
   
   No problems were detected with this index.
   ```
   
   I manually corrupted some neighbors lists in order to check the failure 
cases:
   ```
       test: open reader.........OK [took 0.012 sec]
       test: check integrity.....OK [took 3.633 sec]
       test: check live docs.....OK [took 0.000 sec]
       test: field infos.........OK [2 fields] [took 0.000 sec]
       test: field norms.........OK [0 fields] [took 0.000 sec]
       test: terms, freq, prox...    test: stored fields.......OK [1500000 
total field count; avg 1.0 fields per doc] [took 0.393 sec]
       test: term vectors........OK [0 total term vector count; avg 0.0 
term/freq vector fields per doc] [took 0.000 sec]
       test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET; 0 SKIPPING INDEX] [took 0.000 sec]
       test: points..............OK [0 fields, 0 points] [took 0.000 sec]
       test: vectors.............OK [1 fields, 1500000 vectors] [took 0.504 sec]
       test: hnsw graph..........ERROR: 
org.apache.lucene.index.CheckIndex$CheckIndexException: Neighbors out of order 
for node 31142: 839445<1326517 1st=33190
   org.apache.lucene.index.CheckIndex$CheckIndexException: Neighbors out of 
order for node 31142: 839445<1326517 1st=33190
        at 
org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CheckIndex.testHnswGraph(CheckIndex.java:2798)
        at 
org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:1109)
        at 
org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:806)
        at 
org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:576)
        at 
org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:4479)
        at 
org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CheckIndex.doMain(CheckIndex.java:4316)
        at 
org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CheckIndex.main(CheckIndex.java:4248)
   ```
   
   I intend to dig into the HNSW creation and think about more checks that 
would be useful here, but I wanted to start small with these neighbor checks 
only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to