1> Yes. Note that the distributed tf/idf is an issue, although it's changing.
That is, if your documents are statistically very different across
shards, the scores aren't really comparable. This is changing, but
I don't think it's committed yet.
2> Well, you're mixing apples and oranges I think. The general
recommendation is to use a single core and *replicate* it
across as many machines as necessary until you index gets
too big to fit on your machines (i.e. you cannot get decent
query times at all). This is NOT distributed searching as
each request is wholly serviced by a single slave searcher.
Once you cross the threshold of what fits on your hardware,
you really have no choice except to shard and use distributed
searching.
There is certainly some overhead, but since you have no
choice but to pay it, you just cope <G>.
At very large scale (i.e. lots of shards on lots of machines),
you run into the "laggard problem". That is, as the number
of shards increases, so does the chance that at least one
of them will, for whatever reason, take an anomalously long
time to complete which will slow your final results.
FWIW
Erick
On Sun, Jan 1, 2012 at 4:34 AM, shlomi java <[email protected]> wrote:
> hola,
>
> 1) When distributing search across several Shards, is the merged result
> reflects the overall ranking, cross-shards?
> I'm talking about stuff like "document frequency".
> I guess it does, otherwise distributed search wouldn't have overhead.
>
> talking about overhead,
> 2) is there a known ratio of the overhead of using shards against
> single core, and the impact on performance for adding the N+1 shard to the
> distributed index?
>
> thanks for any knowledge/thought.
> ShlomiJ