jmazanec15 commented on issue #12342:
URL: https://github.com/apache/lucene/issues/12342#issuecomment-1649102549
> 🤦 yep!
> Here is with the higher max conn. Sort of better.
Right, I was thinking this might explain the recall descrepency for the
dotproduct score change (0.989 vs 0.991)
I ran the tests for non-transformed and the numbers seem pretty similar
across the board:
```
### Random (default order)
recall latency nDoc fanout maxConn beamWidth visited index ms
0.715 0.79 400000 0 48 200 10 1910428 1.00
post-filter
0.973 3.87 400000 90 48 200 100 1923226 1.00
post-filter
0.990 6.76 400000 190 48 200 200 1927580 1.00
post-filter
0.998 13.78 400000 490 48 200 500 1917602 1.00
post-filter
### Ascend
recall latency nDoc fanout maxConn beamWidth visited index ms
0.771 0.89 400000 0 48 200 10 2093236 1.00
post-filter
0.983 4.45 400000 90 48 200 100 2095450 1.00
post-filter
0.993 7.88 400000 190 48 200 200 2094090 1.00
post-filter
0.998 16.08 400000 490 48 200 500 2112938 1.00
post-filter
### Descend
recall latency nDoc fanout maxConn beamWidth visited index ms
0.710 0.79 400000 0 48 200 10 1915806 1.00
post-filter
0.973 3.73 400000 90 48 200 100 1910817 1.00
post-filter
0.991 6.55 400000 190 48 200 200 1898517 1.00
post-filter
0.998 13.25 400000 490 48 200 500 1912997 1.00
post-filter
```
@benwtrent For your results, I see that visited was 0 which might mean there
is some kind of bug.
I transformed the data (thanks @searchivarius for help), and I got results
that had overall lower recall, but were a little bit faster:
```
### Random (default order)
recall latency nDoc fanout maxConn beamWidth visited index ms
0.359 0.36 400000 0 48 200 10 1464332 1.00
post-filter
0.728 1.39 400000 90 48 200 100 1457250 1.00
post-filter
0.801 2.43 400000 190 48 200 200 1471881 1.00
post-filter
0.874 5.28 400000 490 48 200 500 1458984 1.00
post-filter
### Ascend
recall latency nDoc fanout maxConn beamWidth visited index ms
0.289 0.31 400000 0 48 200 10 1315149 1.00
post-filter
0.705 1.17 400000 90 48 200 100 1312877 1.00
post-filter
0.794 2.00 400000 190 48 200 200 1316609 1.00
post-filter
0.877 4.32 400000 490 48 200 500 1303967 1.00
post-filter
### Descend
recall latency nDoc fanout maxConn beamWidth visited index ms
0.211 1.20 400000 0 48 200 10 2321339 1.00
post-filter
0.691 6.57 400000 90 48 200 100 2312672 1.00
post-filter
0.814 11.75 400000 190 48 200 200 2313213 1.00
post-filter
0.926 26.31 400000 490 48 200 500 2307567 1.00
post-filter
```
Based on these results and the paper's @searchivarius shared, I think its
probably okay to not add this transform now.
<details>
<summary> <h2>Here is the script I used for transforming data
set</h2></summary>
import numpy as np
import pyarrow.parquet as pq
def transform_queries(Q):
n, _ = Q.shape
return np.concatenate([Q, np.zeros((n, 1))], axis=-1, dtype=np.float32)
def transform_docs(D, norms):
n, d = D.shape
max_norm = magnitudes.max()
flipped_norms = np.copy(norms).reshape(n, 1)
transformed_data = np.concatenate([D, np.sqrt(max_norm**2 -
flipped_norms**2)], axis=-1, dtype=np.float32)
return transformed_data
def validate_array_match_upto_dim(arr1, arr2, dim_eq_upto):
assert np.allclose(arr1[:dim_eq_upto], arr2[:dim_eq_upto]), "data sets
are different"
def validate_dataset_match_upto_dim(arr1, arr2, dim_eq_upto):
n1, d1 = arr1.shape
n2, d2 = arr2.shape
assert n1 == n2, "Shape does not map"
for i in range(n1):
validate_array_match_upto_dim(arr1[i], arr2[i], dim_eq_upto)
tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet",
columns=['emb'])
tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet",
columns=['emb'])
tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet",
columns=['emb'])
tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet",
columns=['emb'])
np1 = tb1[0].to_numpy()
np2 = tb2[0].to_numpy()
np3 = tb3[0].to_numpy()
np4 = tb4[0].to_numpy()
np_total = np.concatenate((np1, np2, np3, np4))
# Have to convert to a list here to get
# the numpy ndarray's shape correct later
# There's probably a better way...
flat_ds = list()
for vec in np_total:
flat_ds.append(vec)
# Shape is (485859, 768) and dtype is float32
np_flat_ds = np.array(flat_ds)
transformed_queries = transform_queries(np_flat_ds[475858:-1])
validate_dataset_match_upto_dim(transformed_queries, np_flat_ds[475858:-1],
768)
with open("wiki768.test", "w") as out_f:
transformed_queries.tofile(out_f)
magnitudes = np.linalg.norm(np_flat_ds[0:400000], axis=1)
indices = np.argsort(magnitudes)
transformed_np_flat_ds = transform_docs(np_flat_ds[0:400000], magnitudes)
validate_dataset_match_upto_dim(transformed_np_flat_ds,
np_flat_ds[0:400000], 768)
transformed_np_flat_ds_sorted = transformed_np_flat_ds[indices]
with open("wiki768.random.train", "w") as out_f:
transformed_np_flat_ds.tofile(out_f)
with open("wiki768.ordered.train", "w") as out_f:
transformed_np_flat_ds_sorted.tofile(out_f)
with open("wiki768.reversed.train", "w") as out_f:
np.flip(transformed_np_flat_ds_sorted).tofile(out_f)
<\details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]