Andrzej Bialecki wrote: > On 2010-10-25 11:22, Toke Eskildsen wrote: >> On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: >>> But itshows a problem of distrubted search without common idf. >>> A doc will get different score in different shard. >> Bingo. >> >> I really don't understand why this fundamental problem with sharding >> isn't mentioned more often. Every time the advice "use sharding" is >> given, it should be followed with a "but be aware that it will make >> relevance ranking unreliable". > > The reason is twofold, I think:
And a third potential reason - it's arguably a feature instead of a bug for some applications. Depending on how I organize my shards, "give me the most relevant document from each shard for this search" seems like it could be useful. > * there is an exact solution to this problem, namely to make two > distributed calls instead of one (first call to collect per-shard IDFs > for given query terms, second call to submit a query rewritten with the > global IDF-s). This solution is implemented in SOLR-1632, with some > caching to reduce the cost for common queries. However, this means that > now for every query you need to make two calls instead of one, which > potentially doubles the time to return results (for simple common > queries - for rare complex queries the time will be still dominated by > the query runtime on shard servers). > > * another reason is that in many many cases the difference between using > exact global IDF and per-shard IDFs is not that significant. If shards > are more or less homogenous (e.g. you assign documents to shards by > hash(docId)) then term distributions will be also similar. So then the > question is whether you can accept an N% variance in scores across > shards, or whether you want to bear the cost of an additional > distributed RPC for every query... > > To summarize, I would qualify your statement with: "...if the > composition of your shards is drastically different". Otherwise the cost > of using global IDF is not worth it, IMHO. >