Hi Yonik, I see, I didn't realize that there was a 2nd phase to retrieve stored values. Sphinx also queries the top n number of documents and combines the results - unless the algorithm is very different, I wouldn't expect that this adds a lot of overhead as sphinx has a very definite performance boost when distributing.
It seems to me that all of the points you're making below apply just as well to distributing across multiple boxes - where does the issue of doing the distribution on a single box come into play? Anecdotally, everything you're saying completely meshes with my load testing of Solr (single full index is performing better than the distributed index). I may have to stick with Sphinx, though, if I can't boost the performance of Solr on a single box. -Harish yonik wrote: > > On Thu, Jan 8, 2009 at 10:03 PM, smock <harish.agar...@gmail.com> wrote: >> I don't mean to be argumentative - just trying to understand, what is the >> difference between distributed search across processors, and distributed >> search across boxes (again, assuming that my searches are truly CPU >> bound)? > > Even if your searches are CPU bound, there is CPU and IO overhead in > distributed search. > > time_for_whole_index > vs > time_for_half_index + distributed_search_overhead > > Distributed search is optimized for the case when the index is so big > that one *must* distribute it across multiple shards. It works in > multiple phases, first only collecting and merging the document ids, > and then requesting stored fields for the top documents in another > phase. It's also optimized for total throughput of the whole system. > > If one was optimizing for response time with smaller documents and > single requests, then merging results in a single shot would yield > better results. > > If you load test a distributed vs non-distributed system on a single > box, the distributed will normally lose. This is because to find the > top 10 documents in general, one must retrieve the top 10 documents > from each shard - more work is done. Single request latency *can* be > shorter under the right circumstances, but under load it will always > lose since more work is done in aggregate. > > -Yonik > > -- View this message in context: http://www.nabble.com/Solr-on-a-multiprocessor-machine-tp21360747p21365956.html Sent from the Solr - User mailing list archive at Nabble.com.