benwtrent commented on issue #12342:
URL: https://github.com/apache/lucene/issues/12342#issuecomment-1649997886
OK, I reran my experiments. I ran two, one with `reverse` non-transformed
(so dimension within knnPerf is 768) and one with reverse transformed
(dimensions are 769).
### Reverse not transformed (768 dims)
```
recall latency nDoc fanout maxConn beamWidth visited index ms
0.145 0.38 400000 0 48 200 10 0 1.00
post-filter
0.553 2.05 400000 90 48 200 100 0 1.00
post-filter
0.709 3.66 400000 190 48 200 200 0 1.00
post-filter
0.878 8.05 400000 490 48 200 500 0 1.00
post-filter
```
### Reversed transformed (769 dims)
```
recall latency nDoc fanout maxConn beamWidth visited index ms
0.211 0.49 400000 0 48 200 10 0 1.00
post-filter
0.691 2.80 400000 90 48 200 100 0 1.00
post-filter
0.814 5.14 400000 190 48 200 200 0 1.00
post-filter
0.926 11.31 400000 490 48 200 500 0 1.00
post-filter
```
Recall seems improved for me. Latency increases in the transformed data. I
bet part of this is also the overhead of dealing with CPU execution lanes in
Panama as its no longer a "nice" number of dimensions.
So, my `transformed` numbers match exactly @jmazanec15 results. However, I
am getting some extreme discrepancy on my non-transformed.
@jmazanec15 here is the code I used to generate my "reverse" non transformed
data. Could you double check and make sure your `descending` case data does the
same?
There is something significant here that we are missing.
```python
import numpy as np
import pyarrow.parquet as pq
tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet",
columns=['emb'])
tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet",
columns=['emb'])
tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet",
columns=['emb'])
tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet",
columns=['emb'])
np1 = tb1[0].to_numpy()
np2 = tb2[0].to_numpy()
np3 = tb3[0].to_numpy()
np4 = tb4[0].to_numpy()
np_total = np.concatenate((np1, np2, np3, np4))
#Have to convert to a list here to get
#the numpy ndarray's shape correct later
#There's probably a better way...
flat_ds = list()
for vec in np_total:
flat_ds.append(vec)
#Shape is (485859, 768) and dtype is float32
np_flat_ds = np.array(flat_ds)
magnitudes = np.linalg.norm(np_flat_ds[0:400000], axis=1)
indices = np.argsort(magnitudes)
np_flat_ds_sorted = np_flat_ds[indices]
with open("wiki768.reversed.train", "w") as out_f:
np.flip(np_flat_ds_sorted).tofile(out_f)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]