[GitHub] [lucene] jmazanec15 commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

via GitHub Mon, 24 Jul 2023 21:24:44 -0700


jmazanec15 commented on issue #12342:
URL: https://github.com/apache/lucene/issues/12342#issuecomment-1649102549


   > 🤦 yep!
   > Here is with the higher max conn. Sort of better.
   
   Right, I was thinking this might explain the recall descrepency  for the 
dotproduct score change (0.989 vs 0.991)
   
   I ran the tests for non-transformed and the numbers seem pretty similar 
across the board:
   ```
   ### Random (default order)
   recall  latency nDoc  fanout  maxConn beamWidth visited index ms
   0.715    0.79   400000  0       48      200     10      1910428 1.00    
post-filter
   0.973    3.87   400000  90      48      200     100     1923226 1.00    
post-filter
   0.990    6.76   400000  190     48      200     200     1927580 1.00    
post-filter
   0.998   13.78   400000  490     48      200     500     1917602 1.00    
post-filter
   
   ### Ascend
   recall  latency nDoc  fanout  maxConn beamWidth visited index ms
   0.771    0.89   400000  0       48      200     10      2093236 1.00    
post-filter
   0.983    4.45   400000  90      48      200     100     2095450 1.00    
post-filter
   0.993    7.88   400000  190     48      200     200     2094090 1.00    
post-filter
   0.998   16.08   400000  490     48      200     500     2112938 1.00    
post-filter
   
   ### Descend
   recall  latency nDoc  fanout  maxConn beamWidth visited index ms
   0.710    0.79   400000  0       48      200     10      1915806 1.00    
post-filter
   0.973    3.73   400000  90      48      200     100     1910817 1.00    
post-filter
   0.991    6.55   400000  190     48      200     200     1898517 1.00    
post-filter
   0.998   13.25   400000  490     48      200     500     1912997 1.00    
post-filter
   ```
   
   @benwtrent For your results, I see that visited was 0 which might mean there 
is some kind of bug.
   
   
   I transformed the data (thanks @searchivarius for help), and I got results 
that had overall lower recall, but were a little bit faster:
   
   ```
   ### Random (default order)
   recall  latency nDoc  fanout  maxConn beamWidth visited index ms
   0.359    0.36   400000  0       48      200     10      1464332 1.00    
post-filter
   0.728    1.39   400000  90      48      200     100     1457250 1.00    
post-filter
   0.801    2.43   400000  190     48      200     200     1471881 1.00    
post-filter
   0.874    5.28   400000  490     48      200     500     1458984 1.00    
post-filter
   
   ### Ascend
   recall  latency nDoc  fanout  maxConn beamWidth visited index ms
   0.289    0.31   400000  0       48      200     10      1315149 1.00    
post-filter
   0.705    1.17   400000  90      48      200     100     1312877 1.00    
post-filter
   0.794    2.00   400000  190     48      200     200     1316609 1.00    
post-filter
   0.877    4.32   400000  490     48      200     500     1303967 1.00    
post-filter
   
   ### Descend
   recall  latency nDoc  fanout  maxConn beamWidth visited index ms
   0.211    1.20   400000  0       48      200     10      2321339 1.00    
post-filter
   0.691    6.57   400000  90      48      200     100     2312672 1.00    
post-filter
   0.814   11.75   400000  190     48      200     200     2313213 1.00    
post-filter
   0.926   26.31   400000  490     48      200     500     2307567 1.00    
post-filter
   ```
   
   Based on these results and the paper's @searchivarius shared, I think its 
probably okay to not add this transform now.
   
   <details>
    <summary> <h2>Here is the script I used for transforming data 
set</h2></summary>
   
   import numpy as np
   import pyarrow.parquet as pq
   
   def transform_queries(Q):
       n, _ = Q.shape
       return np.concatenate([Q, np.zeros((n, 1))], axis=-1, dtype=np.float32)
   
   def transform_docs(D, norms):
       n, d = D.shape
       max_norm = magnitudes.max()
       flipped_norms = np.copy(norms).reshape(n, 1)
       transformed_data = np.concatenate([D, np.sqrt(max_norm**2 - 
flipped_norms**2)], axis=-1, dtype=np.float32)
       return transformed_data
   
   
   def validate_array_match_upto_dim(arr1, arr2, dim_eq_upto):
       assert np.allclose(arr1[:dim_eq_upto], arr2[:dim_eq_upto]), "data sets 
are different"
   
   
   def validate_dataset_match_upto_dim(arr1, arr2, dim_eq_upto):
       n1, d1 = arr1.shape
       n2, d2 = arr2.shape
       assert n1 == n2, "Shape does not map"
       for i in range(n1):
           validate_array_match_upto_dim(arr1[i], arr2[i], dim_eq_upto)
   
   
   
   tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet", 
columns=['emb'])
   tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet", 
columns=['emb'])
   tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet", 
columns=['emb'])
   tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet", 
columns=['emb'])
   
   np1 = tb1[0].to_numpy()
   np2 = tb2[0].to_numpy()
   np3 = tb3[0].to_numpy()
   np4 = tb4[0].to_numpy()
   
   np_total = np.concatenate((np1, np2, np3, np4))
   
   # Have to convert to a list here to get
   # the numpy ndarray's shape correct later
   # There's probably a better way...
   flat_ds = list()
   for vec in np_total:
       flat_ds.append(vec)
   
   # Shape is (485859, 768) and dtype is float32
   np_flat_ds = np.array(flat_ds)
   
   transformed_queries = transform_queries(np_flat_ds[475858:-1])
   validate_dataset_match_upto_dim(transformed_queries, np_flat_ds[475858:-1], 
768)
   with open("wiki768.test", "w") as out_f:
       transformed_queries.tofile(out_f)
   
   magnitudes = np.linalg.norm(np_flat_ds[0:400000], axis=1)
   indices = np.argsort(magnitudes)
   
   transformed_np_flat_ds = transform_docs(np_flat_ds[0:400000], magnitudes)
   validate_dataset_match_upto_dim(transformed_np_flat_ds, 
np_flat_ds[0:400000], 768)
   transformed_np_flat_ds_sorted = transformed_np_flat_ds[indices]
   
   with open("wiki768.random.train", "w") as out_f:
       transformed_np_flat_ds.tofile(out_f)
   
   with open("wiki768.ordered.train", "w") as out_f:
       transformed_np_flat_ds_sorted.tofile(out_f)
   
   with open("wiki768.reversed.train", "w") as out_f:
       np.flip(transformed_np_flat_ds_sorted).tofile(out_f)
   <\details>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jmazanec15 commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

Reply via email to