Hello ,
       Thanks for your detailed explaination.

> Do you want to punish *more* long documents?

    Not alot, but a bit more than default implementation. It seems
"lengthNorm" is field based and pinushing lengthy fields does fit most
of the cases in our project.

> There will be a trade-off since there are lots of parameters.
> If you have two-words query which one is important for you:
> A short document containing one word?
> A long document containing two word?
> Or
> A long document containing one query term which is very rare (high idf)
> A short document containing one query term which is very common (low idf)
>
> Many combinations...

        After getting aware of all these combinations, it seems not
wise to proceed blindly by punushing what ever we want. Thank you very
much for letting me know.

           I have observed "scoring" by enabling debugging. There is a
clear explaination of scoring --> product of x,y,z..
where I can find tf, idf, and fieldNorm() but not  "coord(q,d)  and
queryNorm(q)". Its mentioned that querNorm is the same for all, so no
worries but what about coord(q,d) ?

Thanks.

On 16 February 2010 18:00, Ahmet Arslan <iori...@yahoo.com> wrote:
>
>> Hello ,
>>           Thanks. That clears my
>> doubts. Coming to the point two, Can
>> you please tell me which part of the Similarity takes care
>> of the
>> same. Is it possible to implement in such a way that we
>> give more
>> preference to "number of found terms".
>
> public float coord(int overlap, int maxOverlap) method takes care:
>
> "coord(q,d) is a score factor based on how many of the query terms are found 
> in the specified document. Typically, a document that contains more of the 
> query's terms will receive a higher score than another document with fewer 
> query terms. This is a search time factor computed in coord(q,d) by the 
> Similarity in effect at search time."
>
>> Also, here in our case  we need
>> to give more importance to "length normalisation" than the
>> default?
>
> Do you want to punish *more* long documents?
> For example you can return directly 1/numTerms or 1/(numTerms*numTerms) in 
> this method of DefaultSimilarity:
>
> /** Implemented as <code>1/sqrt(numTerms)</code>. */
> �...@override
>  public float lengthNorm(String fieldName, int numTerms) {
>    return (float)(1.0 / Math.sqrt(numTerms));
>  }
>
> There will be a trade-off since there are lots of parameters.
> If you have two-words query which one is important for you:
> A short document containing one word?
> A long document containing two word?
> Or
> A long document containing one query term which is very rare (high idf)
> A short document containing one query term which is very common (low idf)
>
> Many combinations...
>
>
>
>

Reply via email to