msokolov commented on PR #926: URL: https://github.com/apache/lucene/pull/926#issuecomment-1164418508
Hi Alessandro, thank you for running the tests. I'm suspicious of the results though -- they just look too good to be true! I know from profiling that we spend most of the time in similarity computations, yet this change doesn't impact how many of those we do nor how costly they are. One thing I see is that you are using an `hdf5` file as input, but this tester was not designed to accept that format. This is a script I have used to extract raw floating-point data (what KnnGraphTester expects) from hdf5. This also takes care of normalizing to unit vectors, which you should do for angular data, but nor euclidean ``` import h5py import numpy as np import sys with h5py.File(sys.argv[1], 'r') as f: for key in f.keys(): print(f"{key}: {f[key].shape}") ds = f[key] print(f"copying {ds.shape} from {key}") arr = np.zeros(ds.shape, dtype='float32') ds.read_direct(arr) # normalize all vectors (along dim 1) to unit length norm = np.linalg.norm(arr, 2, 1) norm[norm==0] = 1 arr = arr / np.expand_dims(norm, 1) arr.tofile(sys.argv[1] + "-" + key) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org