On 3/19/2012 11:55 PM, Ankita Patil wrote:
Hi,

I wanted to know whether it is feasible to query on all the shards even if
the query yields data only from a few shards n not all. Or is it better to
mention those shards explicitly from which we get the data and only query
on them.

for example :
I have 4 shards. Now I have a query which yields data only from 2 shards.
So shoud I select those 2 shards only and query on them or it is ok to
query on all the shards? Will that affect the performance in any way?

I use a sharded index, but I am not a seasoned Java/Solr/Lucene developer. My clients do not use the shards parameter themselves - they talk to a a load balancer, which in turn talks to a special core that has the shards in its request handler config and has no index of its own. I call it a broker, because that is what our previous search product (EasyAsk) called it.

As I understand things, the performance of your slowest shard, whether that is because of index size on that shard or the underlying hardware, will be a large factor in the performance of the entire index. A distributed query sends an identical query to all the shards it is configured for. It gathers all those results in parallel and builds a final result to send to the client.

You MIGHT get better performance by not including the other shards. If the "no results" shard query returns super-fast, it probably won't really make any difference. If it takes a long time to get the answer that there are no results, then removing them would make things go faster. That requires intelligence on the client to know where the data is. If the client does not know where the data is, it is safer to simply include all the shards.

Thanks,
Shawn

Reply via email to