LantaoJin commented on PR #16214:
URL: https://github.com/apache/lucene/pull/16214#issuecomment-4666315741

   > in that situation you typically need to reindex everything
   
   Thanks @jimczi, @iverase — this is the right thing to pressure-test. Let me 
give the concrete shape of the use case, because I think it narrows the 
disagreement.
   
   **The index**: one vector field + ~49 other fields (analyzed text, stored, 
doc-values). The 49 fields are **immutable** for the lifetime of the index. 
Only the embedding changes: we fine-tune **model1 → model2** (same dimension, 
same similarity), and want to refresh the vector field for the affected docs.
   
   You're right that the **vector field itself is rebuilt** regardless -- every 
re-embedded doc changes its vector, so the flat column is fully rewritten and 
the HNSW graph fully rebuilt no matter which path we take. In-place doesn't 
claim to save any vector-side work, and I agree that "you reindex the vector 
field anyway" is true.
   
   The cost it removes is the **other 49 fields**. Today the only way to 
refresh the vectors is to build a new index with all 50 fields and drop the old 
one -- which re-analyzes and rewrites the 49 unchanged fields for zero benefit, 
and requires the caller to still **possess every field's source value** (often 
it lives in a separate source-of-truth system and isn't cheaply 
reconstructable). `updateDocument` has the same problem: it's whole-document, 
so it drags the 49 fields along. In our indices the 49 fields are the bulk of 
the indexing cost, so skipping them is the actual win -- not the vector write.
   
   So I'd reframe the value prop as: update the embedding given only `(id, 
newVector)`, without rebuilding or even possessing the rest of the document. 
The model-version bump is one instance; partial/rolling re-embedding of a 
subset is another.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to