benwtrent commented on issue #16263: URL: https://github.com/apache/lucene/issues/16263#issuecomment-4718212138
> The underlying reason appears to be that a short query or query concept has a much higher probability of matching a short segment well than a longer segment containing the same information. This is due to the model, not the index. The index structure itself doesn't add any bias. It should not add bias. The embedding model should encode all important information and add bias it deems necessary. That said. The great thing embeddings is that they are just numbers. So, if you want to add a bias directly yourself, multiply the vector components by their document length (or sqrt the length) and using maximum-inner product. Then the magnitude is encoding the document and query length. As with anything, measure the results. The magnitude may end up becoming the dominating feature for matching...so...maybe normalize the magnitude multiple by some known corpus statistics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
