Re: [I] Bias Towards Short Text Segments in Vector Search Results [lucene]

via GitHub Tue, 16 Jun 2026 04:25:22 -0700


benwtrent commented on issue #16263:
URL: https://github.com/apache/lucene/issues/16263#issuecomment-4718212138


   > The underlying reason appears to be that a short query or query concept 
has a much higher probability of matching a short segment well than a longer 
segment containing the same information.
   
   This is due to the model, not the index.
   
   The index structure itself doesn't add any bias. It should not add bias. The 
embedding model should encode all important information and add bias it deems 
necessary.
   
   That said.
   
   The great thing embeddings is that they are just numbers. So, if you want to 
add a bias directly yourself, multiply the vector components by their document 
length (or sqrt the length) and using maximum-inner product. Then the magnitude 
is encoding the document and query length. As with anything, measure the 
results. The magnitude may end up becoming the dominating feature for 
matching...so...maybe normalize the magnitude multiple by some known corpus 
statistics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Bias Towards Short Text Segments in Vector Search Results [lucene]

Reply via email to