On 8/1/2014 3:17 PM, Ethan wrote: > Our SolrCloud setup : 3 Nodes with Zookeeper, 2 running SolrCloud. > > Current dataset size is 97GB, JVM is 10GB, but 6GB is used(for less garbage > collection time). RAM is 96GB, > > Our softcommit is set to 2secs and hardcommit is set to 1 hour. > > We are suddenly seeing high disk and network IOs. During search the leader > usually logs one more query with it's node name and shard information - > > "{NOW=1406911121656&shard.url= > chexjvassoms006.ch.expeso.com:52158/solr/Main...... > ids=-9223372036371158536,-9223372036373602680,-9223372036618637568,-9223372036371157736......&distrib=false&timeAllowed=2000&wt=javabin&isShard=true" > > The actually query didn't have any of this information. This started just > today and causing lot of latency issues. We have had nodes go down several > times today.
That query is from distributed search -- it's the query that actually retrieves the documents from the shards after the results of the initial query have been tabulated to determine which documents are needed. The "ids" parameter is what tells me this. Do you know how long those autoSoftCommit operations take? If you are indexing frequently enough and the commits are taking longer than the configured interval of two seconds, you may be having multiple commits happening at the same time. Soft commits are faster and use fewer resources than hard commits, but they aren't even close to free -- they're going to hit the disk and memory very hard. One thing to note: An hour may be too long for the hard commit interval. Hard commits result in a new transaction log being started, so on restart, Solr will replay all of the updates that occurred in the last hour. If your update rate is low, that might be acceptable, but if the update rate is high, that could be a LOT of updates, making Solr restarts *very* slow. Thanks, Shawn