On 1/8/2015 7:26 AM, Andrew Butkus wrote: > Hi, we have 8 solr servers, split 4x4 across 2 data centers. > > We have a collection of around ½ billion documents, split over 100 shards, > each is replicated 4 times on separate nodes (evenly distributed across both > data centers). > > The problem we have is that when we use cursormark (and also when we don't > use cursormark the pattern below is the same but just shorter in time) the > time it takes to query each shard gets progressively longer when distrib=true > , I have tried to query shards directly (with shards=) and select my own > shards to query to see if it was a bandwidth bottleneck and the performance > is normal / fine - when using pre-defined shards. > > Does anyone know why the shards become progressively slower when > distrib=true? Or any suggestions on how I can fix, or how to debug the > problem further? > > I have monitored the performance of CPU and it never goes above 10% on each > server, so its not cpu, also the memory usage is about 4gb out of 16gb so its > not a memory issue either. > > I have tried all shard shuffling strategies incase it was a bottleneck at a > server being over used but as above, the cpu never goes above 10%, and when I > use shards= there are never any querytime bottlenecks.
The part about memory usage is not clear. That 4GB and 16GB could refer to the operating system view of memory, or the view of memory within the JVM. I'm curious about how much total RAM each machine has, how large the Java heap is, and what the total size of the indexes that live on each machine is. Even if they are individually very small, 500 million documents will result in a very large index, so I'm guessing that you don't have enough RAM on each server for your index size. What can happen with a highly sharded index that is too large for available RAM: Index data for the initial queries gets read from the OS disk cache, but as those queries run, the information required for the shards that come later in the distributed query gets pushed out of the disk cache, so Solr must actually read the disk to do those later queries. Disks are slow, so if the machine has to actually read from the disk, Solr will be slow. http://wiki.apache.org/solr/SolrPerformanceProblems#RAM Thanks, Shawn