benwtrent commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1644597999
I updated the script for gathering the data to handle adversarial cases of magnitudes in order and reverse order. I have ran the in-order version so far, testing the rest now. ORDERED ``` WARNING: Gnuplot module not present; will not make charts recall latency nDoc fanout maxConn beamWidth visited index ms 0.741 0.33 400000 0 32 200 10 0 1.00 post-filter 0.979 1.67 400000 90 32 200 100 0 1.00 post-filter 0.992 2.89 400000 190 32 200 200 0 1.00 post-filter ``` <details> <summary> <h2>Updated script</h2></summary> ```python import numpy as np import pyarrow.parquet as pq tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet", columns=['emb']) tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet", columns=['emb']) tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet", columns=['emb']) tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet", columns=['emb']) np1 = tb1[0].to_numpy() np2 = tb2[0].to_numpy() np4 = tb4[0].to_numpy() np3 = tb3[0].to_numpy() np_total = np.concatenate((np1, np2, np3, np4)) # Have to convert to a list here to get # the numpy ndarray's shape correct later # There's probably a better way... flat_ds = list() for vec in np_total: flat_ds.append(vec) np_flat_ds = np.array(flat_ds) # Shape is (485859, 768) and dtype is float32 np_flat_ds with open("wiki768.test", "w") as out_f: np_flat_ds[475858:-1].tofile(out_f) magnitudes = np.linalg.norm(np_flat_ds[0:400000], axis=1) indices = np.argsort(magnitudes) np_flat_ds_sorted = np_flat_ds[indices] with open("wiki768.ordered.train", "w") as out_f: np_flat_ds_sorted.tofile(out_f) with open("wiki768.reversed.train", "w") as out_f: np.flip(np_flat_ds_sorted).tofile(out_f) with open("wiki768.random.train", "w") as out_f: np.random.shuffle(np_flat_ds_sorted) np_flat_ds_sorted.tofile(out_f) ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org