benwtrent commented on PR #13586: URL: https://github.com/apache/lucene/pull/13586#issuecomment-2248470540
@jpountz I build an index with ~1M CohereV3 floating point vectors (this requires about ~4GB of ram), force merged into a single segment, and benchmarked on `e2-medium` (4GB of ram) with 1GB set aside for the heap. To download the vectors, I used a script like the following: ```bash #!/bin/sh base_url="https://huggingface.co/api/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3/parquet/en/train/" for i in {0..10} do url="${base_url}${i}.parquet" output_file="${i}-en.parquet" echo "Downloading: $url" curl -L "$url" -o "$output_file" & done wait ``` Then to concat into a single file that can be used by Lucene Util: ```python import pyarrow.parquet as pq import numpy as np name = "wiki1024en" tbs = [pq.read_table(f"{x}-en.parquet", columns=['emb']) for x in range(11)] nps = [tb[0].to_numpy() for tb in tbs] np_total = np.concatenate(nps) flat_ds = list() for vec in np_total: flat_ds.append(vec) np_flat_ds = np.array(flat_ds) print(f"{np_flat_ds.shape}") row_count = np_flat_ds.shape[0] dims = np_flat_ds.shape[1] query_count = 1_000 training_rows = row_count - query_count print(f"{name} num rows: {training_rows}") with open(f"{name}.test", "wb") as out_f: np_flat_ds[training_rows:-1].tofile(out_f) with open(f"{name}.train", "wb") as out_f: np_flat_ds[0:training_rows].tofile(out_f) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org