On Thu, Jan 8, 2009 at 10:03 PM, smock <harish.agar...@gmail.com> wrote:
> I don't mean to be argumentative - just trying to understand, what is the
> difference between distributed search across processors, and distributed
> search across boxes (again, assuming that my searches are truly CPU bound)?

Even if your searches are CPU bound, there is CPU and IO overhead in
distributed search.

time_for_whole_index
  vs
time_for_half_index + distributed_search_overhead

Distributed search is optimized for the case when the index is so big
that one *must* distribute it across multiple shards.  It works in
multiple phases, first only collecting and merging the document ids,
and then requesting stored fields for the top documents in another
phase.  It's also optimized for total throughput of the whole system.

If one was optimizing for response time with smaller documents and
single requests, then merging results in a single shot would yield
better results.

If you load test a distributed vs non-distributed system on a single
box, the distributed will normally lose.  This is because to find the
top 10 documents in general, one must retrieve the top 10 documents
from each shard - more work is done.  Single request latency *can* be
shorter under the right circumstances, but under load it will always
lose since more work is done in aggregate.

-Yonik

Reply via email to