1> Yes. Note that the distributed tf/idf is an issue, although it's changing. That is, if your documents are statistically very different across shards, the scores aren't really comparable. This is changing, but I don't think it's committed yet. 2> Well, you're mixing apples and oranges I think. The general recommendation is to use a single core and *replicate* it across as many machines as necessary until you index gets too big to fit on your machines (i.e. you cannot get decent query times at all). This is NOT distributed searching as each request is wholly serviced by a single slave searcher.
Once you cross the threshold of what fits on your hardware, you really have no choice except to shard and use distributed searching. There is certainly some overhead, but since you have no choice but to pay it, you just cope <G>. At very large scale (i.e. lots of shards on lots of machines), you run into the "laggard problem". That is, as the number of shards increases, so does the chance that at least one of them will, for whatever reason, take an anomalously long time to complete which will slow your final results. FWIW Erick On Sun, Jan 1, 2012 at 4:34 AM, shlomi java <shlomij...@gmail.com> wrote: > hola, > > 1) When distributing search across several Shards, is the merged result > reflects the overall ranking, cross-shards? > I'm talking about stuff like "document frequency". > I guess it does, otherwise distributed search wouldn't have overhead. > > talking about overhead, > 2) is there a known ratio of the overhead of using shards against > single core, and the impact on performance for adding the N+1 shard to the > distributed index? > > thanks for any knowledge/thought. > ShlomiJ