Re: Merging results from Shards - relevancy and performance

Erick Erickson Sun, 01 Jan 2012 08:06:44 -0800

1> Yes. Note that the distributed tf/idf is an issue, although it's changing.
     That is, if your documents are statistically very different across
     shards, the scores aren't really comparable. This is changing, but
    I don't think it's committed yet.
2> Well, you're mixing apples and oranges I think. The general
     recommendation is to use a single core and *replicate* it
     across as many machines as necessary until you index gets
     too big to fit on your machines (i.e. you cannot get decent
     query times at all). This is NOT distributed searching as
     each request is wholly serviced by a single slave searcher.


     Once you cross the threshold of what fits on your hardware,
     you really have no choice except to shard and use distributed
     searching.

     There is certainly some overhead, but since you have no
     choice but to pay it, you just cope <G>.

     At very large scale (i.e. lots of shards on lots of machines),
     you run into the "laggard problem". That is, as the number
     of shards increases, so does the chance that at least one
     of them will, for whatever reason, take an anomalously long
     time to complete which will slow your final results.

FWIW
Erick

On Sun, Jan 1, 2012 at 4:34 AM, shlomi java <shlomij...@gmail.com> wrote:
> hola,
>
> 1) When distributing search across several Shards, is the merged result
> reflects the overall ranking, cross-shards?
> I'm talking about stuff like "document frequency".
> I guess it does, otherwise distributed search wouldn't have overhead.
>
> talking about overhead,
> 2) is there a known ratio of the overhead of using shards against
> single core, and the impact on performance for adding the N+1 shard to the
> distributed index?
>
> thanks for any knowledge/thought.
> ShlomiJ

Re: Merging results from Shards - relevancy and performance

Reply via email to