xiangfu0 commented on issue #10919: URL: https://github.com/apache/pinot/issues/10919#issuecomment-1593813274
Here are some takes from my side: High level principals: - CPU solution - KNN search has to be a distributed solution - The minimal search space is considered within one segment level(10-100MM rows/points) - Pluggable index structure along with the search algorithm Considering the doc size in one segment is usually < 10MM, so I think any of current **billion scale** approach is sufficient for us. In terms of implementation, here is just take an example of using SPTAG(https://github.com/microsoft/SPTAG), paper is: https://arxiv.org/pdf/2111.08566.pdf During Index build phase, we need to build per segment basis SPTAG index. Use hierarchical balanced clustering to generate a set of regions(centroids). We can configure below two parameters: - Number of regions or the percentage of total points are centroids(number of regions). From paper, 16% for best for search performance and memory usage - Replicas for a vector assigned to multiple closed clusters, larger number means better recall but search requires more resources and longer latency. From paper, 8 is best to balance perf and latency. Need to use RNG algorithm to avoid the high similarity of posting list for close regions During Query phase: kNN search functionality should be able to configure: - k(required), which is how many results to fetch, - t(optional), a percent number to include more regions to search based on the distance to the closest centroids, this will increase the recall rate but still keep low resources usage -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org