mayya-sharipova edited a comment on pull request #536: URL: https://github.com/apache/lucene/pull/536#issuecomment-1003460979
> I think it would be prudent to check the size increase/decrease from this change for some dataset/parameter choices I've checked the index sizes and the size actually increased by 4-5%: **glove-100-angular** Before the change: 517 Mb; after the change: 542 Mb **sift-128-euclidean** Before the change: 542M; after change: 564M With a proposed design even if we save space by not storing offsets, we encode each node's neighbours as *Int* instead of the current *VInt*, which causes more disk usage. On the upside: - we save heap memory as we don't need to load offsets, which saves for an index with 1M docs approximately : 1'000'000 nodes * 8 bytes for each node = 8M bytes (doesn't look that much but if there are many indices with many vector fields and more docs this increases proportionally; for a field with 100M docs it will be 800Mb) - no noticeable performance degradation because of an extra step to calculate offsets: **glove-100-angular** | | baseline recall | baseline QPS | candidate recall | candidate QPS | | ----------- | --------------: | -----------: | ---------------: | ------------: | | n_cands=10 | 0.496 | 3549.216 | 0.481 | 4027.582 | | n_cands=20 | 0.560 | 3423.073 | 0.553 | 3369.245 | | n_cands=40 | 0.635 | 2686.622 | 0.631 | 2633.146 | | n_cands=80 | 0.708 | 1889.805 | 0.707 | 1890.202 | | n_cands=120 | 0.747 | 1476.286 | 0.748 | 1451.970 | | n_cands=200 | 0.790 | 1037.742 | 0.791 | 1013.580 | | n_cands=400 | 0.840 | 607.183 | 0.841 | 572.152 | | n_cands=600 | 0.865 | 433.513 | 0.865 | 402.504 | | n_cands=800 | 0.880 | 341.052 | 0.881 | 320.057 | **sift-128-euclidean** | | baseline recall | baseline QPS | candidate recall | candidate QPS | | ----------- | --------------: | -----------: | ---------------: | ------------: | | n_cands=10 | 0.747 | 3891.531 | 0.745 | 3926.015 | | n_cands=20 | 0.817 | 3359.364 | 0.817 | 3365.934 | | n_cands=40 | 0.889 | 2590.605 | 0.889 | 2568.544 | | n_cands=80 | 0.944 | 1798.558 | 0.944 | 1806.776 | | n_cands=120 | 0.964 | 1383.721 | 0.964 | 1425.713 | | n_cands=200 | 0.983 | 973.862 | 0.983 | 1002.114 | | n_cands=400 | 0.994 | 586.816 | 0.994 | 599.229 | | n_cands=600 | 0.997 | 427.128 | 0.997 | 437.296 | | n_cands=800 | 0.998 | 341.178 | 0.998 | 349.690 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org