benchaplin opened a new pull request, #13984: URL: https://github.com/apache/lucene/pull/13984
### Description This is a first pass on #13724 - I've added the following checks: - verify neighbor list is in order - verify no duplicate neighbors The code was adapted from [KnnGraphTester](https://github.com/mikemccand/luceneutil/blob/3a6235c2038de6c6d5d0575d8f58cf06c2836793/src/main/knn/KnnGraphTester.java#L521) in luceneutil. Here's some test results from an index containing the cohere-wikipedia-docs-768d dataset: ``` test: open reader.........OK [took 0.010 sec] test: check integrity.....OK [took 2.207 sec] test: check live docs.....OK [took 0.000 sec] test: field infos.........OK [2 fields] [took 0.000 sec] test: field norms.........OK [0 fields] [took 0.000 sec] test: terms, freq, prox... test: stored fields.......OK [1500000 total field count; avg 1.0 fields per doc] [took 0.398 sec] test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET; 0 SKIPPING INDEX] [took 0.000 sec] test: points..............OK [0 fields, 0 points] [took 0.000 sec] test: vectors.............OK [1 fields, 1500000 vectors] [took 0.497 sec] test: hnsw graph..........OK [4 levels, 1547684 nodes (over all levels)] [took 0.485 sec] No problems were detected with this index. ``` I manually corrupted some neighbors lists in order to check the failure cases: ``` test: open reader.........OK [took 0.012 sec] test: check integrity.....OK [took 3.633 sec] test: check live docs.....OK [took 0.000 sec] test: field infos.........OK [2 fields] [took 0.000 sec] test: field norms.........OK [0 fields] [took 0.000 sec] test: terms, freq, prox... test: stored fields.......OK [1500000 total field count; avg 1.0 fields per doc] [took 0.393 sec] test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET; 0 SKIPPING INDEX] [took 0.000 sec] test: points..............OK [0 fields, 0 points] [took 0.000 sec] test: vectors.............OK [1 fields, 1500000 vectors] [took 0.504 sec] test: hnsw graph..........ERROR: org.apache.lucene.index.CheckIndex$CheckIndexException: Neighbors out of order for node 31142: 839445<1326517 1st=33190 org.apache.lucene.index.CheckIndex$CheckIndexException: Neighbors out of order for node 31142: 839445<1326517 1st=33190 at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CheckIndex.testHnswGraph(CheckIndex.java:2798) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:1109) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:806) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:576) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:4479) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CheckIndex.doMain(CheckIndex.java:4316) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CheckIndex.main(CheckIndex.java:4248) ``` I intend to dig into the HNSW creation and think about more checks that would be useful here, but I wanted to start small with these neighbor checks only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org