1.5M docs in an hour isn't near the rates I saw that trigger the LIR problem, so I strongly doubt that's the issue, never mind ;)
On Thu, Jul 2, 2015 at 1:47 PM, Vincenzo D'Amore <v.dam...@gmail.com> wrote: > We are trying to send documents as fast as we can, we wrote a multi-thread > Solrj application that read from file, solr, or rdbms and update a > collection. > But if we have too much threads during the day servers become unresponsive. > Now, in the night, with a low number of search, we reindex the entire > collection (1.5M docs) with 2 threads in about 1 h. > > As I wrote, now I'm using CMS (ConcurrentMarkSweep), and I supposed that > use Shawn's suggestions about GC was enough to have the right > configuration. > > > > > On Thu, Jul 2, 2015 at 7:05 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> bq: and we do a full update of all documents during the night. >> >> How fast are you sending documents? Prior to Solr 5.2 the replicas >> would do a twice the amount of work for indexing that the leader >> did (odd, but...) See: >> >> http://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/ >> >> Still, focusing on the GC pauses is probably the most fruitful. You just >> shouldn't be getting pauses that long with 16G heaps. How long does it >> take you to re-index? I've seen situations where indexing at an >> _extremely_ high rate will force replicas into recovery. This took 150 >> threads >> all firing queries as fast as possible to hit, but I thought I'd mention >> it. >> >> Best, >> Erick >> >> On Thu, Jul 2, 2015 at 12:56 PM, Vincenzo D'Amore <v.dam...@gmail.com> >> wrote: >> > Hi Erick, >> > >> > thanks for your answer. >> > >> > We use java 8 and allocate a 16GB heap size >> > >> > -Xms2g -Xmx16g >> > >> > There are 1.5M docs and about 16 GB index size on disk. >> > >> > Let me also say, during the day we have a lot of little update, from 1k >> to >> > 50k docs every time, and we do a full update of all documents during the >> > night. And during this full update the 20 seconds GC happened. >> > >> > I haven't read completely the Uwe's post just because was too long, all I >> > got was that I have to use MMapDirectory. >> > But I was still unable to restart the production with this new component. >> > After the change it is not clear if we only need to restart the core/node >> > or if a full reindex must be done. >> > >> > Thanks for your time, I'll read very carefully Uwe's post. >> > >> > >> > On Thu, Jul 2, 2015 at 5:39 PM, Erick Erickson <erickerick...@gmail.com> >> > wrote: >> > >> >> Vincenzo: >> >> >> >> First and foremost, figure out why you're having 20 second GC pauses. >> For >> >> indexes like you're describing, this is unusual. How big is the heap >> >> you allocate to the JVM? >> >> >> >> Check your Zookeeper timeout. In earlier versions of SolrCloud it >> >> defaulted to >> >> 15 seconds. Going into leader election would happen for no obvious >> reason, >> >> and lengthening it to 30-60 seconds seemed to help a lot of people. >> >> >> >> The disks should be largely irrelevant to the origin or cure for this >> >> problem... >> >> >> >> Here's a good article on why you want to allocate "just enough" heap >> >> for your app. Of course, "just enough" can be interesting to actually >> >> define: >> >> >> >> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html >> >> >> >> Best, >> >> Erick >> >> >> >> On Thu, Jul 2, 2015 at 5:45 AM, Vincenzo D'Amore <v.dam...@gmail.com> >> >> wrote: >> >> > Hi All, >> >> > >> >> > In the latest months my SolrCloud clusters, sometimes (one/two times a >> >> > week), have few replicas down. >> >> > Usually all the replicas goes down on the same node. >> >> > I'm unable to understand why a 3 nodes cluster with 8 core/32 GB and >> high >> >> > performance disks have this problem. The main index is small, about >> 1.5 M >> >> > of documents with very small text inside. >> >> > I don't know if having 3 shards with 3 replicas is too much, to me it >> >> seems >> >> > a fair high high availability, but anyway this should not compromise >> the >> >> > cluster stability. >> >> > All the queries are under the second, so it is responsive. >> >> > >> >> > Few months ago I begun to think the problem was related to an old and >> >> > bugged version of SolrCloud that we have to upgrade. >> >> > But reading in this list about the classic XY problem I changed my >> mind, >> >> > maybe there a much better solution. >> >> > >> >> > This night I had, again, a couple of replicas down around 1.07 AM, >> this >> >> is >> >> > the SolrCloud log file: >> >> > >> >> > http://pastebin.com/raw.php?i=bCHnqnXD >> >> > >> >> > At end of exceptions list there are few "cancelElection did not find >> >> > election node to remove" errors and this morning I found the replicas >> >> down. >> >> > >> >> > Looking GC log file I found that at same moment there is a GC that >> takes >> >> > about 20 seconds. Now I'm using CMS (ConcurrentMarkSweep) Collector >> taken >> >> > from Shawn Hensey suggestions: >> >> > >> >> >> https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector >> >> > >> >> > >> >> > http://pastebin.com/raw.php?i=VuSrg4uz >> >> > >> >> > At last, looking around in the latest months I found this bug, that >> seems >> >> > to me be related to with this problems. >> >> > So I begun to think that I need an upgrade, am I right? What do you >> think >> >> > about ? >> >> > >> >> > https://issues.apache.org/jira/browse/SOLR-6159 >> >> > >> >> > Any help is very appreciated. >> >> > >> >> > Thanks, >> >> > Vincenzo >> >> >> > >> > >> > >> > -- >> > Vincenzo D'Amore >> > email: v.dam...@gmail.com >> > skype: free.dev >> > mobile: +39 349 8513251 >> > > > > -- > Vincenzo D'Amore > email: v.dam...@gmail.com > skype: free.dev > mobile: +39 349 8513251