jmazanec15 commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1649102549
> 🤦 yep! > Here is with the higher max conn. Sort of better. Right, I was thinking this might explain the recall descrepency for the dotproduct score change (0.989 vs 0.991) I ran the tests for non-transformed and the numbers seem pretty similar across the board: ``` ### Random (default order) recall latency nDoc fanout maxConn beamWidth visited index ms 0.715 0.79 400000 0 48 200 10 1910428 1.00 post-filter 0.973 3.87 400000 90 48 200 100 1923226 1.00 post-filter 0.990 6.76 400000 190 48 200 200 1927580 1.00 post-filter 0.998 13.78 400000 490 48 200 500 1917602 1.00 post-filter ### Ascend recall latency nDoc fanout maxConn beamWidth visited index ms 0.771 0.89 400000 0 48 200 10 2093236 1.00 post-filter 0.983 4.45 400000 90 48 200 100 2095450 1.00 post-filter 0.993 7.88 400000 190 48 200 200 2094090 1.00 post-filter 0.998 16.08 400000 490 48 200 500 2112938 1.00 post-filter ### Descend recall latency nDoc fanout maxConn beamWidth visited index ms 0.710 0.79 400000 0 48 200 10 1915806 1.00 post-filter 0.973 3.73 400000 90 48 200 100 1910817 1.00 post-filter 0.991 6.55 400000 190 48 200 200 1898517 1.00 post-filter 0.998 13.25 400000 490 48 200 500 1912997 1.00 post-filter ``` @benwtrent For your results, I see that visited was 0 which might mean there is some kind of bug. I transformed the data (thanks @searchivarius for help), and I got results that had overall lower recall, but were a little bit faster: ``` ### Random (default order) recall latency nDoc fanout maxConn beamWidth visited index ms 0.359 0.36 400000 0 48 200 10 1464332 1.00 post-filter 0.728 1.39 400000 90 48 200 100 1457250 1.00 post-filter 0.801 2.43 400000 190 48 200 200 1471881 1.00 post-filter 0.874 5.28 400000 490 48 200 500 1458984 1.00 post-filter ### Ascend recall latency nDoc fanout maxConn beamWidth visited index ms 0.289 0.31 400000 0 48 200 10 1315149 1.00 post-filter 0.705 1.17 400000 90 48 200 100 1312877 1.00 post-filter 0.794 2.00 400000 190 48 200 200 1316609 1.00 post-filter 0.877 4.32 400000 490 48 200 500 1303967 1.00 post-filter ### Descend recall latency nDoc fanout maxConn beamWidth visited index ms 0.211 1.20 400000 0 48 200 10 2321339 1.00 post-filter 0.691 6.57 400000 90 48 200 100 2312672 1.00 post-filter 0.814 11.75 400000 190 48 200 200 2313213 1.00 post-filter 0.926 26.31 400000 490 48 200 500 2307567 1.00 post-filter ``` Based on these results and the paper's @searchivarius shared, I think its probably okay to not add this transform now. <details> <summary> <h2>Here is the script I used for transforming data set</h2></summary> import numpy as np import pyarrow.parquet as pq def transform_queries(Q): n, _ = Q.shape return np.concatenate([Q, np.zeros((n, 1))], axis=-1, dtype=np.float32) def transform_docs(D, norms): n, d = D.shape max_norm = magnitudes.max() flipped_norms = np.copy(norms).reshape(n, 1) transformed_data = np.concatenate([D, np.sqrt(max_norm**2 - flipped_norms**2)], axis=-1, dtype=np.float32) return transformed_data def validate_array_match_upto_dim(arr1, arr2, dim_eq_upto): assert np.allclose(arr1[:dim_eq_upto], arr2[:dim_eq_upto]), "data sets are different" def validate_dataset_match_upto_dim(arr1, arr2, dim_eq_upto): n1, d1 = arr1.shape n2, d2 = arr2.shape assert n1 == n2, "Shape does not map" for i in range(n1): validate_array_match_upto_dim(arr1[i], arr2[i], dim_eq_upto) tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet", columns=['emb']) tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet", columns=['emb']) tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet", columns=['emb']) tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet", columns=['emb']) np1 = tb1[0].to_numpy() np2 = tb2[0].to_numpy() np3 = tb3[0].to_numpy() np4 = tb4[0].to_numpy() np_total = np.concatenate((np1, np2, np3, np4)) # Have to convert to a list here to get # the numpy ndarray's shape correct later # There's probably a better way... flat_ds = list() for vec in np_total: flat_ds.append(vec) # Shape is (485859, 768) and dtype is float32 np_flat_ds = np.array(flat_ds) transformed_queries = transform_queries(np_flat_ds[475858:-1]) validate_dataset_match_upto_dim(transformed_queries, np_flat_ds[475858:-1], 768) with open("wiki768.test", "w") as out_f: transformed_queries.tofile(out_f) magnitudes = np.linalg.norm(np_flat_ds[0:400000], axis=1) indices = np.argsort(magnitudes) transformed_np_flat_ds = transform_docs(np_flat_ds[0:400000], magnitudes) validate_dataset_match_upto_dim(transformed_np_flat_ds, np_flat_ds[0:400000], 768) transformed_np_flat_ds_sorted = transformed_np_flat_ds[indices] with open("wiki768.random.train", "w") as out_f: transformed_np_flat_ds.tofile(out_f) with open("wiki768.ordered.train", "w") as out_f: transformed_np_flat_ds_sorted.tofile(out_f) with open("wiki768.reversed.train", "w") as out_f: np.flip(transformed_np_flat_ds_sorted).tofile(out_f) <\details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org