Andrzej Bialecki wrote:
> On 2010-10-25 11:22, Toke Eskildsen wrote:
>> On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: 
>>> But itshows a problem of distrubted search without common idf.
>>> A doc will get different score in different shard.
>> Bingo.
>>
>> I really don't understand why this fundamental problem with sharding
>> isn't mentioned more often. Every time the advice "use sharding" is
>> given, it should be followed with a "but be aware that it will make
>> relevance ranking unreliable".
> 
> The reason is twofold, I think:


And a third potential reason - it's arguably a feature instead of a bug
for some applications.  Depending on how I organize my shards, "give me
the most relevant document from each shard for this search" seems like
it could be useful.

> * there is an exact solution to this problem, namely to make two
> distributed calls instead of one (first call to collect per-shard IDFs
> for given query terms, second call to submit a query rewritten with the
> global IDF-s). This solution is implemented in SOLR-1632, with some
> caching to reduce the cost for common queries. However, this means that
> now for every query you need to make two calls instead of one, which
> potentially doubles the time to return results (for simple common
> queries - for rare complex queries the time will be still dominated by
> the query runtime on shard servers).
> 
> * another reason is that in many many cases the difference between using
> exact global IDF and per-shard IDFs is not that significant. If shards
> are more or less homogenous (e.g. you assign documents to shards by
> hash(docId)) then term distributions will be also similar. So then the
> question is whether you can accept an N% variance in scores across
> shards, or whether you want to bear the cost of an additional
> distributed RPC for every query...
> 
> To summarize, I would qualify your statement with: "...if the
> composition of your shards is drastically different". Otherwise the cost
> of using global IDF is not worth it, IMHO.
> 

Reply via email to