The very first thing I would do is straighten out your commit strategy, they are _very_ aggressive. I'd guess you're also seeing warnings in the logs about "too many on deck searchers" or something like, or you've upped your max warming searchers in solrconfig.xml.
Soft commits aren't free. They're less expensive than hard commits (openSearcher=true), but they're not free. Here's a long writeup on this: https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ What I'd do: 1> remove maxDocs entirely 2> set openSearcher=false for your autoCommit 3> remove maxDocs from your autoSoftCommit 4> lengthen the soft commit as much as you can stand. 5> if you must have very short soft commits, consider turning off (or at least down) your caches in solrconfig.xml 6> stop issuing any kind of commits from the client. This is an anti-pattern except in very unusual circumstances and in your setup you see all the docs 2 seconds later anyway so it is doing you no good and (maybe) active harm. If the problem persists, try looking at your garbage collection, you may well be hitting long GC pauses. Also note that there was a bottleneck in Solr prior to 5.2 when replicas were present, see: http://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/ Best, Erick On Sun, Jun 21, 2015 at 7:14 AM, danny teichthal <dannyt...@gmail.com> wrote: > Hi, > > > We are experiencing some intermittent slowness on updates for one of our > collections. > > We see user operations hanging on updates to SOLR via SolrJ client. > > Every time in the period of the slowness we see something like this in the > log of the replica: > > [org.apache.solr.update.UpdateHandler] Reordered DBQs detected. > Update=add{_version_=1504391336428568576,id= > > 2392581250002321} DBQs=[DBQ{version=1504391337298886656,q=level_2_id:12345}] > > After a while The DBQ is piling up and we see the list of DBQ growing. > > > > > At some point the time of updates is increase from 300 ms to 20 seconds and > then on the leader log I see read timeout exception and it initiates > recovery on the replica. > > At that point all updates start to be very slow – from 20 seconds to 60 > seconds. Especially updates with deletByQuery. > > We are not sure if the DBQ is the cause or symptom. But, what does not make > sense to me is that the slowness is only on the replica side. > > We suspect that the fact that the updates become slow on the replica cause > a timeout on the leader side and cause the recovery. > > > Would really appreciate any help on this. > > > Thanks, > > > > > > > > > Some info: > > DBQ are sent as a separate update request from the add requests. > > > We currently use SolrCloud 4.9.0. > > We have ~140 collections on 4 nodes – 1,2,3,4. > > Each collection has a single shard with a leader and another replica. > > ~70 collections are on node 1 and 2 as leader and replica and the other > collections are on 3 and 4. > > > > On each node there’s about 65GB of index with 25,000,000 documents. > > > > This is our update handler, autoSoftCommit is set to 2 seconds, but there > may be manual soft commits coming from user operations from time to time: > > > > <updateHandler class="solr.DirectUpdateHandler2"> > > <autoCommit> > > <maxDocs>10000</maxDocs> > > <maxTime>120000</maxTime> > > <openSearcher>true</openSearcher> > > </autoCommit> > > <autoSoftCommit> > > <maxDocs>1000</maxDocs> > > <maxTime>2000</maxTime> > > </autoSoftCommit> > > <updateLog /> > > </updateHandler>