benwtrent commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1649997886
OK, I reran my experiments. I ran two, one with `reverse` non-transformed (so dimension within knnPerf is 768) and one with reverse transformed (dimensions are 769). ### Reverse not transformed (768 dims) ``` recall latency nDoc fanout maxConn beamWidth visited index ms 0.145 0.38 400000 0 48 200 10 0 1.00 post-filter 0.553 2.05 400000 90 48 200 100 0 1.00 post-filter 0.709 3.66 400000 190 48 200 200 0 1.00 post-filter 0.878 8.05 400000 490 48 200 500 0 1.00 post-filter ``` ### Reversed transformed (769 dims) ``` recall latency nDoc fanout maxConn beamWidth visited index ms 0.211 0.49 400000 0 48 200 10 0 1.00 post-filter 0.691 2.80 400000 90 48 200 100 0 1.00 post-filter 0.814 5.14 400000 190 48 200 200 0 1.00 post-filter 0.926 11.31 400000 490 48 200 500 0 1.00 post-filter ``` Recall seems improved for me. Latency increases in the transformed data. I bet part of this is also the overhead of dealing with CPU execution lanes in Panama as its no longer a "nice" number of dimensions. So, my `transformed` numbers match exactly @jmazanec15 results. However, I am getting some extreme discrepancy on my non-transformed. @jmazanec15 here is the code I used to generate my "reverse" non transformed data. Could you double check and make sure your `descending` case data does the same? There is something significant here that we are missing. ```python import numpy as np import pyarrow.parquet as pq tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet", columns=['emb']) tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet", columns=['emb']) tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet", columns=['emb']) tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet", columns=['emb']) np1 = tb1[0].to_numpy() np2 = tb2[0].to_numpy() np3 = tb3[0].to_numpy() np4 = tb4[0].to_numpy() np_total = np.concatenate((np1, np2, np3, np4)) #Have to convert to a list here to get #the numpy ndarray's shape correct later #There's probably a better way... flat_ds = list() for vec in np_total: flat_ds.append(vec) #Shape is (485859, 768) and dtype is float32 np_flat_ds = np.array(flat_ds) magnitudes = np.linalg.norm(np_flat_ds[0:400000], axis=1) indices = np.argsort(magnitudes) np_flat_ds_sorted = np_flat_ds[indices] with open("wiki768.reversed.train", "w") as out_f: np.flip(np_flat_ds_sorted).tofile(out_f) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org