benwtrent commented on issue #14342: URL: https://github.com/apache/lucene/issues/14342#issuecomment-2734567137
OK, a colleague and I spent some time digging into this and Option 0 (a bug) turned out to be the case. Its a 5 character change (like all good bugs), but here are the new recall numbers for fashion-minst: Still not mid 90s at 5x oversampling, but WAY better than the abysmal results from before. ``` FLAT Results: recall latency(ms) nDoc topK fanout quantized index(s) index_docs/s num_segments index_size(MB) overSample vec_disk(MB) vec_RAM(MB) indexType 0.444 4.110 60000 10 50 1 bits 0.00 Infinity 1 186.43 1.000 181.646 2.203 FLAT 0.629 4.383 60000 10 50 1 bits 0.00 Infinity 1 186.43 2.000 181.646 2.203 FLAT 0.730 4.437 60000 10 50 1 bits 0.00 Infinity 1 186.43 3.000 181.646 2.203 FLAT 0.792 4.455 60000 10 50 1 bits 0.00 Infinity 1 186.43 4.000 181.646 2.203 FLAT 0.833 4.445 60000 10 50 1 bits 0.00 Infinity 1 186.43 5.000 181.646 2.203 FLAT 0.926 4.607 60000 10 50 1 bits 0.00 Infinity 1 186.43 10.000 181.646 2.203 FLAT ``` ``` HNSW recall latency(ms) nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s num_segments index_size(MB) overSample vec_disk(MB) vec_RAM(MB) indexType 0.443 0.188 60000 10 50 64 250 1 bits 0.00 Infinity 1 189.55 1.000 181.646 2.203 HNSW 0.629 0.274 60000 10 50 64 250 1 bits 0.00 Infinity 1 189.55 2.000 181.646 2.203 HNSW 0.730 0.349 60000 10 50 64 250 1 bits 0.00 Infinity 1 189.55 3.000 181.646 2.203 HNSW 0.792 0.471 60000 10 50 64 250 1 bits 0.00 Infinity 1 189.55 4.000 181.646 2.203 HNSW 0.833 0.479 60000 10 50 64 250 1 bits 0.00 Infinity 1 189.55 5.000 181.646 2.203 HNSW 0.926 0.786 60000 10 50 64 250 1 bits 0.00 Infinity 1 189.55 10.000 181.646 2.203 HNSW ``` For the curious, it had to do with shifting the normal distribution initialization parameters correctly given the standard deviation of the actual vector distribution. We had the mean & std flipped. When these are well behaved, this sort of bug has a tiny effect (which is why we never caught it), but minst isn't well behaved and brought this nasty little bug to light. I am gonna run some more benchmarks and will open a PR soon with the fix. As an aside, there is likely even more gains for non-normal distribution vectors like minst, but they will take more time and effort. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org