[GitHub] [lucene] benwtrent commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

via GitHub Tue, 25 Jul 2023 07:53:59 -0700


benwtrent commented on issue #12342:
URL: https://github.com/apache/lucene/issues/12342#issuecomment-1649997886


   OK, I reran my experiments. I ran two, one with `reverse` non-transformed 
(so dimension within knnPerf is 768) and one with reverse transformed 
(dimensions are 769).
   
   ### Reverse not transformed (768 dims)
   ```
   recall       latency nDoc    fanout  maxConn beamWidth       visited index ms
   0.145         0.38   400000  0       48      200     10      0       1.00    
post-filter
   0.553         2.05   400000  90      48      200     100     0       1.00    
post-filter
   0.709         3.66   400000  190     48      200     200     0       1.00    
post-filter
   0.878         8.05   400000  490     48      200     500     0       1.00    
post-filter
   ```
   
   ### Reversed transformed (769 dims)
   ```
   recall       latency nDoc    fanout  maxConn beamWidth       visited index ms
   0.211        0.49    400000  0       48      200     10      0       1.00    
post-filter
   0.691        2.80    400000  90      48      200     100     0       1.00    
post-filter
   0.814        5.14    400000  190     48      200     200     0       1.00    
post-filter
   0.926        11.31   400000  490     48      200     500     0       1.00    
post-filter
   ```
   
   Recall seems improved for me. Latency increases in the transformed data. I 
bet part of this is also the overhead of dealing with CPU execution lanes in 
Panama as its no longer a "nice" number of dimensions.
   
   So, my `transformed` numbers match exactly @jmazanec15 results. However, I 
am getting some extreme discrepancy on my non-transformed.
   
   @jmazanec15 here is the code I used to generate my "reverse" non transformed 
data. Could you double check and make sure your `descending` case data does the 
same?
   
   There is something significant here that we are missing.
   
   ```python
   import numpy as np
   import pyarrow.parquet as pq
   
   tb1 = pq.read_table("train-00000-of-00004-1a1932c9ca1c7152.parquet", 
columns=['emb'])
   tb2 = pq.read_table("train-00001-of-00004-f4a4f5540ade14b4.parquet", 
columns=['emb'])
   tb3 = pq.read_table("train-00002-of-00004-ff770df3ab420d14.parquet", 
columns=['emb'])
   tb4 = pq.read_table("train-00003-of-00004-85b3dbbc960e92ec.parquet", 
columns=['emb'])
   
   np1 = tb1[0].to_numpy()
   np2 = tb2[0].to_numpy()
   np3 = tb3[0].to_numpy()
   np4 = tb4[0].to_numpy()
   
   np_total = np.concatenate((np1, np2, np3, np4))
   
   
   #Have to convert to a list here to get
   #the numpy ndarray's shape correct later
   #There's probably a better way...
   flat_ds = list()
   for vec in np_total:
       flat_ds.append(vec)
   
   #Shape is (485859, 768) and dtype is float32
   np_flat_ds = np.array(flat_ds)
   
   magnitudes = np.linalg.norm(np_flat_ds[0:400000], axis=1)
   indices = np.argsort(magnitudes)
   np_flat_ds_sorted = np_flat_ds[indices]
   
   with open("wiki768.reversed.train", "w") as out_f:
       np.flip(np_flat_ds_sorted).tofile(out_f)
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] benwtrent commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

Reply via email to