benwtrent commented on PR #14178:
URL: https://github.com/apache/lucene/pull/14178#issuecomment-2622749459

   @kaivalnp the force-merge time indicates that during merge to a single 
segment, the index is being rebuilt from various segments. I would think that 
the `force-merge` time itself is more indicative of the cost of indexing than 
just the initial index phase.
   
   For this benchmark there are a couple of options:
   
    - Increase the KnnIndexer buffer size to allow 2x of the float vectors in 
memory (thus keeping a flush from occurring until the graph is ready to be 
built) and remove the "force-merge" option completely. This will also a single 
segment to be created. Just make sure you have enough heap allocated.
    - Simply sum the two numbers together.
    - Don't forcemerge at all and just accept multiple segments.
   
   
   One other concern. There are two types of "multi-threading" in the indexing. 
There are the number of threads doing the indexing (e.g. writing to an indexer 
and creating a segment) and the number of threads used when building a graph. 
For simplicity, I would reduce the number of indexing threads to 1, faiss 
threads to 1, and merge workers to 1. Once we have numbers of the cost of 
running on a single thread, then we can see how adjusting these allows one to 
pull ahead of the other.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to