benwtrent commented on issue #12342:
URL: https://github.com/apache/lucene/issues/12342#issuecomment-1644597999
I updated the script for gathering the data to handle adversarial cases of
magnitudes in order and reverse order.
I have ran the in-order version so far, testing the rest now.
ORDERED
```
WARNING: Gnuplot module not present; will not make charts
recall latency nDoc fanout maxConn beamWidth visited index ms
0.741 0.33 400000 0 32 200 10 0 1.00
post-filter
0.979 1.67 400000 90 32 200 100 0 1.00
post-filter
0.992 2.89 400000 190 32 200 200 0 1.00
post-filter
```
<details>
<summary> <h2>Updated script</h2></summary>
```python
import numpy as np
import pyarrow.parquet as pq
tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet",
columns=['emb'])
tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet",
columns=['emb'])
tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet",
columns=['emb'])
tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet",
columns=['emb'])
np1 = tb1[0].to_numpy()
np2 = tb2[0].to_numpy()
np4 = tb4[0].to_numpy()
np3 = tb3[0].to_numpy()
np_total = np.concatenate((np1, np2, np3, np4))
# Have to convert to a list here to get
# the numpy ndarray's shape correct later
# There's probably a better way...
flat_ds = list()
for vec in np_total:
flat_ds.append(vec)
np_flat_ds = np.array(flat_ds)
# Shape is (485859, 768) and dtype is float32
np_flat_ds
with open("wiki768.test", "w") as out_f:
np_flat_ds[475858:-1].tofile(out_f)
magnitudes = np.linalg.norm(np_flat_ds[0:400000], axis=1)
indices = np.argsort(magnitudes)
np_flat_ds_sorted = np_flat_ds[indices]
with open("wiki768.ordered.train", "w") as out_f:
np_flat_ds_sorted.tofile(out_f)
with open("wiki768.reversed.train", "w") as out_f:
np.flip(np_flat_ds_sorted).tofile(out_f)
with open("wiki768.random.train", "w") as out_f:
np.random.shuffle(np_flat_ds_sorted)
np_flat_ds_sorted.tofile(out_f)
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]