RE: Skewed IDF in multi lingual index

Markus Jelsma Mon, 12 Nov 2012 04:34:37 -0800

I'd like to add that multiplicative boosting on very scarce properties, e.g. 
you want to boost on a boolean value of which there are only very few, causes a 
problem in scoring when using docCount instead of maxDoc. If docCount is one 
IDF will be ~0.3, with the fieldWeight you'll end up with a score below 0. 
Because of this the product of all multiplicative boosts will be lower than the 
product of boosts similar boosts, lowering the document in rank instead of 
boosting it.


-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Fri 09-Nov-2012 10:23
> To: [email protected]
> Subject: RE: Skewed IDF in multi lingual index
> 
> Robert, Tom,
> 
> That's it indeed! Using maxDoc as numerator opposed to docCount yields very 
> skewed results for an unevenly distributed multi-lingual index. We have one 
> language dominating the other twenty so the dominating language contains no 
> rare terms compared to the others.
> 
> We're now checking results using docCount and it seems alright. I do have to 
> get used to the fact that document scores are now roughly 1000 times higher 
> than before but i'm already very happy with CollectionStatistics and will see 
> if all works well.
> 
> Any other tips to share?
> 
> Thanks,
> Markus
> 
>  
>  
> -----Original message-----
> > From:Robert Muir <[email protected]>
> > Sent: Thu 08-Nov-2012 17:44
> > To: [email protected]
> > Subject: Re: Skewed IDF in multi lingual index
> > 
> > Hi Markus: how are the languages distributed across documents?
> > 
> > Imagine I have a text_en field and a text_fr field. Lets say I have
> > 100 documents, 95 are english and only 5 are french.
> > So the text_en field is populated 95% of the time, and the text_fr 5%
> > of the time.
> > 
> > But the default IDF computation doesnt look at things this way: it
> > always uses '100' as maxDoc. So in such a situation, any terms against
> > text_fr are "rare" :)
> > 
> > The first thing i would look at, is treating this situation as merging
> > results from a english index with 95 docs and a french index with 5
> > docs.
> > So I would consider overriding the two idfExplain methods (term and
> > phrase) to use CollectionStatistics.docCount() instead of
> > CollectionStatistics.maxDoc()
> > The former would be 95 for the english field (instead of 100), and 5
> > for the french field (instead of 100).
> > 
> > I dont think this will solve all your problems: but it might help.
> > 
> > Note: you must ensure your index is fully upgraded to 4.0 to try this
> > statistic, otherwise it will return -1 if you have any 3.x segments in
> > your index.
> > 
> > On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma
> > <[email protected]> wrote:
> > > Hi,
> > >
> > > We're testing a large multi lingual index with _LANG fields for each 
> > > language and using dismax to query them all. Users provide, explicit or 
> > > implicit, language preferences that we use for either additive or 
> > > multiplicative boosting on the language of the document. However, 
> > > additive boosting is not adequate because it cannot overcome the 
> > > extremely high IDF values for the same word in another language so 
> > > regardless of the the preference, foreign documents are returned. 
> > > Multiplicative boosting solves this problem but has the other downside as 
> > > it doesn't allow us with standard qf=field^boost to prefer documents in 
> > > another language above the preferred language because the multiplicative 
> > > is so strong. We do use the def function (boost=def(query($qq),.3)) to 
> > > prevent one boost query to return 0 and thus a product of 0 for all boost 
> > > queries. But it doesn't help that much
> > >
> > > This all comes down to IDF differences between the languages, even common 
> > > words such as country names like `india` show large differences in IDF. 
> > > Is here anyone with some hints or experiences to share about skewed IDF 
> > > in such an index?
> > >
> > > Thanks,
> > > Markus
> > 
>

RE: Skewed IDF in multi lingual index

Reply via email to