Robert, Tom, That's it indeed! Using maxDoc as numerator opposed to docCount yields very skewed results for an unevenly distributed multi-lingual index. We have one language dominating the other twenty so the dominating language contains no rare terms compared to the others.
We're now checking results using docCount and it seems alright. I do have to get used to the fact that document scores are now roughly 1000 times higher than before but i'm already very happy with CollectionStatistics and will see if all works well. Any other tips to share? Thanks, Markus -----Original message----- > From:Robert Muir <rcm...@gmail.com> > Sent: Thu 08-Nov-2012 17:44 > To: solr-user@lucene.apache.org > Subject: Re: Skewed IDF in multi lingual index > > Hi Markus: how are the languages distributed across documents? > > Imagine I have a text_en field and a text_fr field. Lets say I have > 100 documents, 95 are english and only 5 are french. > So the text_en field is populated 95% of the time, and the text_fr 5% > of the time. > > But the default IDF computation doesnt look at things this way: it > always uses '100' as maxDoc. So in such a situation, any terms against > text_fr are "rare" :) > > The first thing i would look at, is treating this situation as merging > results from a english index with 95 docs and a french index with 5 > docs. > So I would consider overriding the two idfExplain methods (term and > phrase) to use CollectionStatistics.docCount() instead of > CollectionStatistics.maxDoc() > The former would be 95 for the english field (instead of 100), and 5 > for the french field (instead of 100). > > I dont think this will solve all your problems: but it might help. > > Note: you must ensure your index is fully upgraded to 4.0 to try this > statistic, otherwise it will return -1 if you have any 3.x segments in > your index. > > On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma > <markus.jel...@openindex.io> wrote: > > Hi, > > > > We're testing a large multi lingual index with _LANG fields for each > > language and using dismax to query them all. Users provide, explicit or > > implicit, language preferences that we use for either additive or > > multiplicative boosting on the language of the document. However, additive > > boosting is not adequate because it cannot overcome the extremely high IDF > > values for the same word in another language so regardless of the the > > preference, foreign documents are returned. Multiplicative boosting solves > > this problem but has the other downside as it doesn't allow us with > > standard qf=field^boost to prefer documents in another language above the > > preferred language because the multiplicative is so strong. We do use the > > def function (boost=def(query($qq),.3)) to prevent one boost query to > > return 0 and thus a product of 0 for all boost queries. But it doesn't help > > that much > > > > This all comes down to IDF differences between the languages, even common > > words such as country names like `india` show large differences in IDF. Is > > here anyone with some hints or experiences to share about skewed IDF in > > such an index? > > > > Thanks, > > Markus >