Hello , Thanks for your detailed explaination. > Do you want to punish *more* long documents?
Not alot, but a bit more than default implementation. It seems "lengthNorm" is field based and pinushing lengthy fields does fit most of the cases in our project. > There will be a trade-off since there are lots of parameters. > If you have two-words query which one is important for you: > A short document containing one word? > A long document containing two word? > Or > A long document containing one query term which is very rare (high idf) > A short document containing one query term which is very common (low idf) > > Many combinations... After getting aware of all these combinations, it seems not wise to proceed blindly by punushing what ever we want. Thank you very much for letting me know. I have observed "scoring" by enabling debugging. There is a clear explaination of scoring --> product of x,y,z.. where I can find tf, idf, and fieldNorm() but not "coord(q,d) and queryNorm(q)". Its mentioned that querNorm is the same for all, so no worries but what about coord(q,d) ? Thanks. On 16 February 2010 18:00, Ahmet Arslan <iori...@yahoo.com> wrote: > >> Hello , >> Thanks. That clears my >> doubts. Coming to the point two, Can >> you please tell me which part of the Similarity takes care >> of the >> same. Is it possible to implement in such a way that we >> give more >> preference to "number of found terms". > > public float coord(int overlap, int maxOverlap) method takes care: > > "coord(q,d) is a score factor based on how many of the query terms are found > in the specified document. Typically, a document that contains more of the > query's terms will receive a higher score than another document with fewer > query terms. This is a search time factor computed in coord(q,d) by the > Similarity in effect at search time." > >> Also, here in our case we need >> to give more importance to "length normalisation" than the >> default? > > Do you want to punish *more* long documents? > For example you can return directly 1/numTerms or 1/(numTerms*numTerms) in > this method of DefaultSimilarity: > > /** Implemented as <code>1/sqrt(numTerms)</code>. */ > �...@override > public float lengthNorm(String fieldName, int numTerms) { > return (float)(1.0 / Math.sqrt(numTerms)); > } > > There will be a trade-off since there are lots of parameters. > If you have two-words query which one is important for you: > A short document containing one word? > A long document containing two word? > Or > A long document containing one query term which is very rare (high idf) > A short document containing one query term which is very common (low idf) > > Many combinations... > > > >