benwtrent commented on PR #13586:
URL: https://github.com/apache/lucene/pull/13586#issuecomment-2248470540

   @jpountz I build an index with ~1M CohereV3 floating point vectors (this 
requires about ~4GB of ram), force merged into a single segment, and 
benchmarked on `e2-medium` (4GB of ram) with 1GB set aside for the heap. 
   
   To download the vectors, I used a script like the following:
   ```bash
   #!/bin/sh
   
   
base_url="https://huggingface.co/api/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3/parquet/en/train/";
   
   for i in {0..10}
   do
       url="${base_url}${i}.parquet"
       output_file="${i}-en.parquet"
       echo "Downloading: $url"
       curl -L "$url" -o "$output_file" &
   done
   wait
   ```
   
   Then to concat into a single file that can be used by Lucene Util:
   
   ```python
   import pyarrow.parquet as pq
   import numpy as np
   
   name = "wiki1024en"
   tbs = [pq.read_table(f"{x}-en.parquet", columns=['emb']) for x in range(11)]
   nps = [tb[0].to_numpy() for tb in tbs]
   np_total = np.concatenate(nps)
   
   flat_ds = list()
   for vec in np_total:
       flat_ds.append(vec)
   np_flat_ds = np.array(flat_ds)
   print(f"{np_flat_ds.shape}")
   row_count = np_flat_ds.shape[0]
   dims = np_flat_ds.shape[1]
   query_count = 1_000
   training_rows = row_count - query_count
   print(f"{name} num rows: {training_rows}")
   
   with open(f"{name}.test", "wb") as out_f:
       np_flat_ds[training_rows:-1].tofile(out_f)
   
   with open(f"{name}.train", "wb") as out_f:
       np_flat_ds[0:training_rows].tofile(out_f)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to